Upload
tokyo-azure-meetup
View
67
Download
4
Embed Size (px)
Citation preview
PowerPoint Presentation
Service Fabric Internals, p1
Service Fabric
Cloud ConsiderationsDont own the hardwareFailures are part of the gameScale is unpredictableManaging services is harder than building themAdvanced telemetry for visibility requiredNo downtime for upgrades Do you control your costs? How about density?Dedicate attention for security
Why Microservices
Evolve continuously
Faster delivery
Build and operate at scale
Application Design - TraditionalProsCompile-time contract validationLocal operationsEasier to understand
ConsExpensive to scale applicationHard to scale data accessUpgrades are difficult
Application Design Service Oriented ProsCheaper to scale applicationEasier to scale data accessUpgrade continuously
ConsRuntime contract validationNetwork operationsHarder to understand
State
Azure Service Fabric
Service Fabric Cluster
Cloud Service vs Service Fabric
Service Fabric Programming Models
Stateless Service Pattern
Stateful Service Pattern
Types of MicroservicesStateless MicroserviceState is stored externally We can have N instancesWeb frontends, protocol gateways, Azure Cloud SerivcesStateful MicroserviceMaintain hard, authoritative stateN consistent copies achieved through replication and local persistenceDatabase, documents, workflows, user profile, shopping cart
Demo Stateless Service
DashboardDeactivate nodeShow diagosticsEnable node
18
Demo Stateful Service
Migrating a traditional applicationDecide on the problems you are solvingScale, agility, resilience
Decided on a well define area to re-architect
You can have mixture of traditional and microservice designs
3) Traditional app with new microservices 2) Traditional app hosted as guest executable or container in Service Fabric
4) Breaking the Traditional app into microservice 5) Transformed into microservicesStages of migrating a traditional application
Migrating a traditional application
Traditional appHosted as guest executable or container in Service FabricWith new microservices added alongisdeBreaking into microservicesTransformed into microservices
You can stop at any stage
Common design pattern using gateways
Web Gateway
REST/Websockets
API Management
IoT Hub
Event Hub
Load Balancer
Using a gateway to integrate a traditional app with Service Fabric
Gateway
ClientClient
Problem Cluster Management nightmares
I always worry about running out of capacity ?
I am not sure if all the VM resources are utilized ?
I am worried sick about my cluster being compromised.
I have no control on when a new Service fabric version is rolled out to my cluster.
I am not sure if what disasters my cluster can survive ?
Best PracticeService Fabric Cluster Management nightmare mitigation
Let us divide the problem space into three buckets
Plan out your cluster capacity
Optimize and Secure your cluster
Manage your cluster version
Best PracticeService Fabric Cluster planning
Capacity planning is not an easy exercise.
Capacity planning is not a one time exercise.
Do not assume that you can add capacity on demand instantly.
Do not assume that you can take downtime to change capacity later
Here are some of the questions you will need answer for, before you can truly plan out your cluster.
Source ControlBuild
OPS
OPS
PPL
PROD
Inner Dev Loop
Test
DEVService Fabric Cluster planningWhat is this cluster to be used for ?Is this to be used for Test ?Is this a part of the CICD pipeline ?Is this for Production use ?
Where do you want this cluster hosted ?On Azure ?On-Premise, in your data center ?On some other cloud provide ?
Are there unique compliance and security requirements?End-to-end RBAC and Auditing ? Certificates OK ? Active Directory OK ?Compliance expectations from the infrastructure ?Compliance goals on the application ?
28
What kinds of workloads are planned to be deployed to it ?For each Application Total State# of instancesReplica set sizePort requirements per serviceIOPS needed External state vs State in the Service Fabric Clusters.Growth rateHow many Node types (what kinds of apps are to be deployed)Are their non-SF services to be run as well ?
Service Fabric Cluster planning
Service Fabric Cluster planningOnce you know what each Application needs, focus on Characteristics of each nodetype.CPURAM Disk (total state of the replicas you want to host)State durability (Gold vs Silver)Reliability (applies only to primary node type). Fault tolerance - # of FD and # of UDChoosing the # of FDsThis determines the headroom needed in case of unplanned failures. Choosing the # of UDsThis determines the headroom needed in case of planned failures.
FD1
FD2
FD3
FD4
FD5Choosing the # of Fault Domains you needNumber of Fault Domains determines the headroom needed in case of unplanned failures.Examples could be a PDU failing or TOR maintenance . Which will typically take out all machines in a Rack.
In terms of capacity you need to leave enough headroom to accommodate failure of at least one FD
This will result in SF moving/creating new replicas on the available Machines in other FDs.PDU Burn outReplica
FD1
FD2
FD3
FD4
FD5Choosing the # of Upgrade Domains you needNumber of Upgrade Domains determines the headroom needed in case of planned failures/downtimes.Examples could be a service fabric upgrade going on, and a UD is down. You have to have room to place additional replicas if need be.
Replica
UD1
UD2
UD3
UD4
UD5
UD6
UD7
UD8
UD9
UD10SF upgrade
FD1
FD2
FD3
FD4
FD5Best practice capacity headroomYou should plan your capacity in such a way that, your service can survive at leastA loss of one FDA UD being down because of an upgrade going onA random node/VM failing additionally.
UD1
UD2
UD3
UD4
UD5
UD6
UD7
UD8
UD9
UD10
33
Best practices Cluster Set upUse ARM template to customize your clusterSpread VMs across multiple storage accountFan out the IOProtection against widespread outage
Use ARM template to drive changes to your Resource GroupEasy configuration managementAuditingAvoid using implicit commands to tweak your resources.Be very pedantic on the configurations your deploy to your production environment
Best practices Cluster Set upUse a separate node type to host system services for large clusters for large cluster.
FD1
FD2
FD3
FD4
FD5
UD1
UD2
UD3
UD4
UD5
UD6
UD7
UD8
UD9
UD10
10 - NT1 Nodes20 -NT2 Nodes7- SF nodesLegend
Best practices Cluster SecurityAlways use a secure cluster to deploy anything you care about
Additionally consider the following
Create DMZs using NSGs
Use Jump boxes to manage your cluster
2014 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.2/26/2017 4:37 PM36
Best practices Cluster SecurityService Fabric ClusterKey VaultAADSecurityLB#3LB#2LB#1NSG#1NSG#2NSG#2VMSS#1VMVMVMVMSS#1VMVMVMVMSS#1VMVMVMFor DiagnosticsAzure StorageFor SF logsFor VHDsFor VHDsFor VHDsService Fabric ClusterVNETLB#3LB#2LB#1VMSS#1VMVMVMVMSS#1VMVMVMVMSS#1VMVMVMNSG#1NSG#2NSG#2Jump box
2014 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.2/26/2017 4:37 PM37
NSG ports that needs to be openedClientConnectionEndpoint (TCP): 19000
HttpGatewayEndpoint (HTTP/TCP): 19080
SMB 445 and 135
ClusterConnectionEndpointPort (TCP): 9025
LeaseDriverEndpointPort (TCP): 9026,
Ephemeral Port range min 256 ports
App ports as needed.
Ability to select a supported Fabric versionSet the upgrade mode to Automatic or ManualSelect the specific fabric version Via APIs or PortalYou can switch between Automatic and ManualYou have 60 days to adopt the new versionA warning is generated 14 days prior to your cluster going out of supportNew versions are announced on the team blogManage you Cluster Version
Microsoft 2016 2016 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.2/26/2017 4:37 PM39
Cluster Fabric Upgrade
Factors to consider when choosing the upgrademodeAvailability of your service Need for predictability of performanceFreedom of choice to select the velocitySupport considerations Recommendation of upgrade mode for for dev, test, PPL, prod
Source ControlBuild
OPS
OPS
PPL
PROD
Inner Dev Loop
Test
DEV
40
Debugging in ProductionDont debug in ProducitonDifficult to catch an issue directlySecurity and compliance concernsShould debug tools be installed on all production nodes?
Instrument your codeInstrumenting your code is critical for debugging based on logsShould be able to trace the execution path through your codeAlways log in UTCFlow an activity id
2016 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.2/26/201741
Data LossRecovery Point Objective (RPO)How much data in minutes can the business afford to lose?Business should set the RPO, smaller RPO is more expensiveEach service must expect and plan for data lossSoft deletes (tombstoning) are best practiceHard delete later when you know it is not neededData CorruptionFrequently caused by software bug (or a hacker) Detect corruption is a hard issue that is domain specificIf needed, deal with corruption using journaling, snapshots/backupsMake sure you test restoring from corruption
Availability and Reliability Active/PassiveAzure Traffic Manager
Cluster A (Primary)
NodeNodeNodeNodeNodeCluster B (Secondary)
NodeNodeNodeNodeNodeReplication Traffic
Two similar clustersOnly Cluster A takes trafficPrimary must handle spikesData replicated to cluster B in the background
Availability and Reliability Active/PassiveAzure Traffic Manager
Cluster A (Primary)
NodeNodeNodeNodeNodeCluster B (Secondary)
NodeNodeNodeNodeNodeReplication Traffic
Failover flowCustomer experiences issueDevOps decides to fail overData inconsistency/lossRPO == replication delayTakes minutesSimple developmentInfrequently testedWasted capacity
Availability and Reliability Active/ActiveAzure Traffic Manager
Cluster A
NodeNodeNodeNodeNodeCluster B
NodeNodeNodeNodeNodeReplication TrafficTwo similar clustersBoth clusters takes trafficBoth clusters handle spikesLess expensiveData replicated to other cluster in the background
2016 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.2/26/201745
Availability and Reliability Active/ActiveAzure Traffic Manager
Cluster A
NodeNodeNodeNodeNodeCluster B
NodeNodeNodeNodeNodeReplication TrafficFailover is fast and freeHarder developmentData inconsistency ordual readsContinuously testedLess wasted capacity
2016 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.2/26/201746
Availability and Reliability Real Example Cluster A
NodeNodeNodeNodeNodeCluster B
NodeNodeNodeNodeNodeAzure Traffic ManagerAzure Storage A(RA-GRS)Azure Storage B(RA-GRS)Azure Storage A(read only)Azure Storage B(read only)
West USEast USCluster A
NodeNodeNodeNodeNodeCluster B
NodeNodeNodeNodeNodeAzure Traffic Manager
Azure Storage A(RA-GRS)Azure Storage B(RA-GRS)Azure Storage A(read only)Azure Storage B(read only)
Replication
West USEast USTwo regionally separated DCs
Can read from or write to either storage (RA-GRS), but default is local DC
Cascading FailuresOne simple failure leads to system-wide failurePlan for failure and understand the impact of failure on the system and its SLAWhen a service fails, clients must retry continuously causing a traffic storm Can occur across regions, active-active cross region are not immune Look at using Circuit Breaker patternsRetry using exponential back-off with a maximum intervalOnce connection is reestablished, reset the back-off interval
2016 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.2/26/201748
Humans cause most ProblemsHuman error causes 60% to 80% of service outages Treat operational procedures like codeAutomate as much as feasibleManual procedures must be one-off processesHumans are slower than automationIf you can document a manual procedure, why cant it be automated?Validate and test automationAutomate certificate and key rotationAlways have two certificates/keys and ensure that one is always validRotate regularlyCaused Azure outage in 2013
Thank you!