Upload
tristan-gatling
View
212
Download
0
Embed Size (px)
Citation preview
AT&T Proprietary (Internal Use Only)Not for use or disclosure outside the AT&T companies
except under written agreement
How to Manage Data Collection in a Large Environment
Paul K Merline &Mike BadaczewskiNovember 15, 2011
Page 2
Which is greater…the average
attendance at Busch Stadium or the
number of servers we collect data on
every night?
Page 3
Answer…..
AT&T
Systems collected nightly= 38,353
Busch Stadium
Average nightly attendance = 38,196(source ESPN.com A/O 10-18-11)
Page 4
Data Collection Goals
Provide consistent, standard and meaningful resource usage data for all servers to support Capacity Planning.
Establish and maintain an environment capable of supporting data collection for 40,000 servers with the existing staff.
Have previous day’s data available by 08:00 local time.
Page 5
Data Collection Overview
Number of metrics collected and retention based upon criticality of server (service levels).
Separate data collection based on platform, e.g. UNIX, Windows, etc.
Spread workload across several centralized data collection servers (Consoles).
Stagger data collection across time zones.
Analyzed data output sent to data base server for Visualizer db loads.
Data Collection Strategy
Tier and Tier Level assigned based on:
• server criticality (MCA, normal production)• status (production, test, development) • in-service indicator
Service Level assigned based on Tier and Tier Level which determines:
• metrics collected• retention period of metrics
Collect and retain only the metrics necessary based on the criticality of the server
Page 6
Page 7
Tier Mapping to Service Levels
Tier Tier
Level Tier Description Service Level
1 9 Systems requiring very complex analysis Platinum
1 0 VMWARE hosts & SUN global zones Gold
1 1 Production servers running company designated Mission Critical Applications
Gold
1 3 Servers running Non Mission Critical application requiring Gold metrics per a Capacity Planner
Gold
2 1 Server running a production application not specified Mission Critical
Silver
2 6 Test & Development servers requiring Silver metrics per a Capacity Planner
Silver
2 2 Server used for non-production test & QA Bronze
2 4 Server used for development work Bronze
Page 8
Data Collection Service Levels
PLATINUM GOLD SILVER BRONZE
Tier 1 Tier 1 Tier 2-1 Tier 2-2 or 2-4
MCA requiring extra care MCA or on frame w/ MCA Non-MCA production Test and development
Custom workload char Custom workload char Standard workload char No workload char
35 UNIX metrics 35 UNIX metrics 21 UNIX metrics 10 UNIX metrics
15 min. data summary 60 min. data summary 60 min. data summary 60 min. data summary
6 mos. data retention 6 mos. data retention 1 mos. data retention 7 day data retention
200 UNIX 7,400 UNIX, 400 Win 6,000 UNIX, 14,000 Win 4,000 UNIX, 3,000 Win
Data Collection Process
Gold Silver BronzeService Level
Region
Platform
(target is 25 servers per domain for performance reasons)
Servers are grouped into collection domains based on:
UNIX Windows VMWare
East Central Mountain Pacific Alaska Hawaii
Frame Frames Non-Frames
Page 9
Page 10
UNIX Metric Groups
UNIX BRONZE METRICS Collector Information Network Statistics Physical Partition Configuration CPU Configuration System Configuration PRM Configuration CPU Statistics System Statistics WLM Configuration File System Statistics
ADDITIONAL UNIX SILVER METRICS Disk Configuration NFS Statistics VMware ESX Host System Configuration Disk Statistics NFS3 Statistics VMware ESX Server Configuration Logical Volume Configuration Process Statistics VMware ESX CPU Statistics Network Configuration User Id Statistics VMware ESX Memory Statistics Network Interface Statistics Pool Configuration VMware ESX Disk Statistics Zone Configuration VMware ESX Network Statistics VMware ESX Per CPU Statistics
ADDITIONAL UNIX GOLD METRICS FSS Configuration RPC Client Statistics Scheduling Policy Info Swap Statistics RPC Server Statistics SRM Statistics Swap Device Statistics
The groupings are cumulative. The Silver service level will collect all the metric groups listed under the Bronze and Silver metric groups. The Gold service level will collect all the metrics listed on this page. Each metric group can generate many measurements and data fields. For example, the System Statistics metric group contains 189 data fields.
Page 11
Data Retentions
Visualizer Data Retention Data Retention in Days
Service Level
Data Summary Interval
(Min) Global Data
Process Detail
Disk Detail
Platinum 15 180 30 30 Gold 60 180 30 30 Silver 60 30 7 7 Bronze 60 7 7 7
UDR Data Retention
Service Level
Collection Data
Interval (Min)
Uncompressed (Days)
Compressed (Days)
Platinum 15 2 30 Gold 15 2 30 Silver 15 2 14 Bronze 60 2 0
Page 12
Current Data Collection Counts
Totals by Service Level
Service Level
UNIX Systems
Windows Systems
VMWare Systems
Total Systems % of Total
Gold 6,091 911 269 7,271 19%
Silver 7,786 8,714 ---- 16,500 43%
Bronze 8.413 2,022 4,147 14,582 38%
Total 22,290 11,647 4,416 38,353 100%
Page 13
Data Collection Tools
The BMC Performance Assurance product family offers a complete solution for performance management of UNIX and Windows systems.
It delivers the following critical functions for managing distributed systems:
Real-time monitoringModeling and predictingGraphical performance analysis
Page 14
BMC Performance Assurance
Servers (nodes)
Collect
Analyze
Console Servers Analyst Console
BMC Predict
BMC Visualizer BMC Perceiver(web-based report viewing)(detailed analysis)
Database Server
(modeling)
VisualizerDatabase
CPDB
FACT Metric Tables
• Hourly• Summarized
Application Reporter
Exception Reporting
(real-time analysis)BMC Investigate
Forecasting (bi-annual planning)
CPDB Reporting
ATT Developed
Page 15
BMC Consoles and Visualizer Database
Console A
Visualizer Database Schemas
Console B Console C Console D
8,566476 domains
Number of Servers Collected from Nightly Automation
8,743475 domains
11,970485 domains
9,074489 domains
Visualizer database is 2.3 Tb. in size and divided into 92 schemas by:
• Platform
• Time Zone
• Service Level
(limit to 1,000 servers per schema for performance)
UNIX Windows VMWare
Gold – 6
Silver – 8
Bronze - 6
East Central Pacific
Gold – 6
Silver – 8
Bronze - 7
Gold – 5
Silver – 7
Bronze - 4
Gold – 1
Silver – 5
Bronze - 2
East Central Pacific
Gold – 1
Silver – 6
Bronze - 2
Gold – 1
Silver – 3
Bronze - 1
Gold – 1
East Central Pacific
Gold – 1 Gold – 1
All Other - 5 All Other - 4 All Other - 1
62 UNIX Schemas 26 Windows Schemas 4 VMWare Schemas
Page 16
Data Collection Process
Perform binaries are laid down with the Patrol installation on the server (node)
A collector runs on each server (node) and writes data to disk periodically (currently set to 15 minutes)
The data is pulled by the Perform Console and processed nightly (hourly summarization) creating “vis” files
Nightly automation consists of 3 processes: RetrieveAnalyzePopulate
Page 17
21:00 11:00
22:00 23:00 24:00 01:00 02:00 03:00 04:00 05:00 06:00 07:00 08:00 09:00 10:00
retrieve 21:20 analyze 22:07 Finish 1:15
EST
CST
PST
retrieve 22:20 analyze 2:12 finish 7:54
retrieve 0:20 analyze 2:22 finish 6:55
Nightly Automation Scheduling
Page 18
Monitoring Environment Results
• Nightly automation stats• 7 time zones• 39 states• 256 cities• 1,925 domains• 1,947 VIS files• 38,353 servers• 621,615 UDR files
• 13.5 new servers added per day over the last year (4,947)
Page 19
AT&T Capacity Planning Database
Page 20
AT&T Application Reporter
Page 21
Bonus MaterialBMC 7.5
Page 22
Performance Assurance Release 7.5 New Features and Functionality
New Virtualization Support• SUN Solaris Logical Domains (LDoms)• SUN Chip Multi-Threading (CMT) technology• IBM AIX Live Partition Mobility• IBM AIX Workload Partitions (WPARs)• IBM PowerVM• HP Integrity Virtual Machines (IVM)• Microsoft 2008 Virtualization Server (Hyper-V)
Enhanced VMware Virtualization Support• Cluster, resource pool, disk and datastore metrics• Info on relationships between servers, virtual machines, pools• Perceiver support for cluster, resource pool and disk views• Improvements to proxy data collector• Complete re-design of Visualizer tables and relationships
Page 23
Performance Assurance Release 7.5 New Features and Functionality (cont)
Console Operations• Improvements to Manager for recovery and reprocessing of data• Manager exception reports• Officially supported Service Levels• New General Manager web application to manage Perform and
Perceiver – daily operation and exceptions• UDR Transfer Utility• Changes to management of Hardware table for performance ratings• Changes to the Visualizer database structures• Problem resolutions and enhancement implementations
7.5 Migration Issues
Page 24
• Some Visualizer tables have been re-designed to accommodate metrics for virtual servers (current metrics may have moved to new tables).
• The changes in Visualizer require migrating all data from the old 7.4 schemas to the new 7.5 schemas.
• If multiple Consoles update the same Visualizer schema, all Consoles must be migrated to release 7.5 at the same time.
• The Visualizer database migrations must be done at the same time the Consoles are migrated to release 7.5.
• Therefore, in our environment, all Consoles and all Visualizer databases must be migrated to 7.5 simultaneously.
• Per BMC, very large Visualizer schemas may take longer than a day to migrate to 7.5 (we have 90+ Visualizer schemas).
• Per BMC, the most significant problems they have seen with the new release involves database migrations.