24
AT&T Proprietary (Internal Use Only) Not for use or disclosure outside the AT&T companies except under written agreement How to Manage Data Collection in a Large Environment Paul K Merline & Mike Badaczewski November 15, 2011

AT&T Proprietary (Internal Use Only) Not for use or disclosure outside the AT&T companies except under written agreement How to Manage Data Collection

Embed Size (px)

Citation preview

Page 1: AT&T Proprietary (Internal Use Only) Not for use or disclosure outside the AT&T companies except under written agreement How to Manage Data Collection

AT&T Proprietary (Internal Use Only)Not for use or disclosure outside the AT&T companies

except under written agreement

How to Manage Data Collection in a Large Environment

Paul K Merline &Mike BadaczewskiNovember 15, 2011

Page 2: AT&T Proprietary (Internal Use Only) Not for use or disclosure outside the AT&T companies except under written agreement How to Manage Data Collection

Page 2

Which is greater…the average

attendance at Busch Stadium or the

number of servers we collect data on

every night?

Page 3: AT&T Proprietary (Internal Use Only) Not for use or disclosure outside the AT&T companies except under written agreement How to Manage Data Collection

Page 3

Answer…..

AT&T

Systems collected nightly= 38,353

Busch Stadium

Average nightly attendance = 38,196(source ESPN.com A/O 10-18-11)

Page 4: AT&T Proprietary (Internal Use Only) Not for use or disclosure outside the AT&T companies except under written agreement How to Manage Data Collection

Page 4

Data Collection Goals

Provide consistent, standard and meaningful resource usage data for all servers to support Capacity Planning.

Establish and maintain an environment capable of supporting data collection for 40,000 servers with the existing staff.

Have previous day’s data available by 08:00 local time.

Page 5: AT&T Proprietary (Internal Use Only) Not for use or disclosure outside the AT&T companies except under written agreement How to Manage Data Collection

Page 5

Data Collection Overview

Number of metrics collected and retention based upon criticality of server (service levels).

Separate data collection based on platform, e.g. UNIX, Windows, etc.

Spread workload across several centralized data collection servers (Consoles).

Stagger data collection across time zones.

Analyzed data output sent to data base server for Visualizer db loads.

Page 6: AT&T Proprietary (Internal Use Only) Not for use or disclosure outside the AT&T companies except under written agreement How to Manage Data Collection

Data Collection Strategy

Tier and Tier Level assigned based on:

• server criticality (MCA, normal production)• status (production, test, development) • in-service indicator

Service Level assigned based on Tier and Tier Level which determines:

• metrics collected• retention period of metrics

Collect and retain only the metrics necessary based on the criticality of the server

Page 6

Page 7: AT&T Proprietary (Internal Use Only) Not for use or disclosure outside the AT&T companies except under written agreement How to Manage Data Collection

Page 7

Tier Mapping to Service Levels

Tier Tier

Level Tier Description Service Level

1 9 Systems requiring very complex analysis Platinum

1 0 VMWARE hosts & SUN global zones Gold

1 1 Production servers running company designated Mission Critical Applications

Gold

1 3 Servers running Non Mission Critical application requiring Gold metrics per a Capacity Planner

Gold

2 1 Server running a production application not specified Mission Critical

Silver

2 6 Test & Development servers requiring Silver metrics per a Capacity Planner

Silver

2 2 Server used for non-production test & QA Bronze

2 4 Server used for development work Bronze

Page 8: AT&T Proprietary (Internal Use Only) Not for use or disclosure outside the AT&T companies except under written agreement How to Manage Data Collection

Page 8

Data Collection Service Levels

PLATINUM GOLD SILVER BRONZE

Tier 1 Tier 1 Tier 2-1 Tier 2-2 or 2-4

MCA requiring extra care MCA or on frame w/ MCA Non-MCA production Test and development

Custom workload char Custom workload char Standard workload char No workload char

35 UNIX metrics 35 UNIX metrics 21 UNIX metrics 10 UNIX metrics

15 min. data summary 60 min. data summary 60 min. data summary 60 min. data summary

6 mos. data retention 6 mos. data retention 1 mos. data retention 7 day data retention

200 UNIX 7,400 UNIX, 400 Win 6,000 UNIX, 14,000 Win 4,000 UNIX, 3,000 Win

Page 9: AT&T Proprietary (Internal Use Only) Not for use or disclosure outside the AT&T companies except under written agreement How to Manage Data Collection

Data Collection Process

Gold Silver BronzeService Level

Region

Platform

(target is 25 servers per domain for performance reasons)

Servers are grouped into collection domains based on:

UNIX Windows VMWare

East Central Mountain Pacific Alaska Hawaii

Frame Frames Non-Frames

Page 9

Page 10: AT&T Proprietary (Internal Use Only) Not for use or disclosure outside the AT&T companies except under written agreement How to Manage Data Collection

Page 10

UNIX Metric Groups

UNIX BRONZE METRICS Collector Information Network Statistics Physical Partition Configuration CPU Configuration System Configuration PRM Configuration CPU Statistics System Statistics WLM Configuration File System Statistics

ADDITIONAL UNIX SILVER METRICS Disk Configuration NFS Statistics VMware ESX Host System Configuration Disk Statistics NFS3 Statistics VMware ESX Server Configuration Logical Volume Configuration Process Statistics VMware ESX CPU Statistics Network Configuration User Id Statistics VMware ESX Memory Statistics Network Interface Statistics Pool Configuration VMware ESX Disk Statistics Zone Configuration VMware ESX Network Statistics VMware ESX Per CPU Statistics

ADDITIONAL UNIX GOLD METRICS FSS Configuration RPC Client Statistics Scheduling Policy Info Swap Statistics RPC Server Statistics SRM Statistics Swap Device Statistics

The groupings are cumulative. The Silver service level will collect all the metric groups listed under the Bronze and Silver metric groups. The Gold service level will collect all the metrics listed on this page. Each metric group can generate many measurements and data fields. For example, the System Statistics metric group contains 189 data fields.

Page 11: AT&T Proprietary (Internal Use Only) Not for use or disclosure outside the AT&T companies except under written agreement How to Manage Data Collection

Page 11

Data Retentions

Visualizer Data Retention Data Retention in Days

Service Level

Data Summary Interval

(Min) Global Data

Process Detail

Disk Detail

Platinum 15 180 30 30 Gold 60 180 30 30 Silver 60 30 7 7 Bronze 60 7 7 7

UDR Data Retention

Service Level

Collection Data

Interval (Min)

Uncompressed (Days)

Compressed (Days)

Platinum 15 2 30 Gold 15 2 30 Silver 15 2 14 Bronze 60 2 0

Page 12: AT&T Proprietary (Internal Use Only) Not for use or disclosure outside the AT&T companies except under written agreement How to Manage Data Collection

Page 12

Current Data Collection Counts

Totals by Service Level

Service Level

UNIX Systems

Windows Systems

VMWare Systems

Total Systems % of Total

Gold 6,091 911 269 7,271 19%

Silver 7,786 8,714 ----  16,500 43%

Bronze 8.413 2,022 4,147  14,582 38%

Total 22,290 11,647 4,416 38,353  100%

Page 13: AT&T Proprietary (Internal Use Only) Not for use or disclosure outside the AT&T companies except under written agreement How to Manage Data Collection

Page 13

Data Collection Tools

The BMC Performance Assurance product family offers a complete solution for performance management of UNIX and Windows systems.

It delivers the following critical functions for managing distributed systems:

Real-time monitoringModeling and predictingGraphical performance analysis

Page 14: AT&T Proprietary (Internal Use Only) Not for use or disclosure outside the AT&T companies except under written agreement How to Manage Data Collection

Page 14

BMC Performance Assurance

Servers (nodes)

Collect

Analyze

Console Servers Analyst Console

BMC Predict

BMC Visualizer BMC Perceiver(web-based report viewing)(detailed analysis)

Database Server

(modeling)

VisualizerDatabase

CPDB

FACT Metric Tables

• Hourly• Summarized

Application Reporter

Exception Reporting

(real-time analysis)BMC Investigate

Forecasting (bi-annual planning)

CPDB Reporting

ATT Developed

Page 15: AT&T Proprietary (Internal Use Only) Not for use or disclosure outside the AT&T companies except under written agreement How to Manage Data Collection

Page 15

BMC Consoles and Visualizer Database

Console A

Visualizer Database Schemas

Console B Console C Console D

8,566476 domains

Number of Servers Collected from Nightly Automation

8,743475 domains

11,970485 domains

9,074489 domains

Visualizer database is 2.3 Tb. in size and divided into 92 schemas by:

• Platform

• Time Zone

• Service Level

(limit to 1,000 servers per schema for performance)

UNIX Windows VMWare

Gold – 6

Silver – 8

Bronze - 6

East Central Pacific

Gold – 6

Silver – 8

Bronze - 7

Gold – 5

Silver – 7

Bronze - 4

Gold – 1

Silver – 5

Bronze - 2

East Central Pacific

Gold – 1

Silver – 6

Bronze - 2

Gold – 1

Silver – 3

Bronze - 1

Gold – 1

East Central Pacific

Gold – 1 Gold – 1

All Other - 5 All Other - 4 All Other - 1

62 UNIX Schemas 26 Windows Schemas 4 VMWare Schemas

Page 16: AT&T Proprietary (Internal Use Only) Not for use or disclosure outside the AT&T companies except under written agreement How to Manage Data Collection

Page 16

Data Collection Process

Perform binaries are laid down with the Patrol installation on the server (node)

A collector runs on each server (node) and writes data to disk periodically (currently set to 15 minutes)

The data is pulled by the Perform Console and processed nightly (hourly summarization) creating “vis” files

Nightly automation consists of 3 processes: RetrieveAnalyzePopulate

Page 17: AT&T Proprietary (Internal Use Only) Not for use or disclosure outside the AT&T companies except under written agreement How to Manage Data Collection

Page 17

21:00 11:00

22:00 23:00 24:00 01:00 02:00 03:00 04:00 05:00 06:00 07:00 08:00 09:00 10:00

retrieve 21:20 analyze 22:07 Finish 1:15

EST

CST

PST

retrieve 22:20 analyze 2:12 finish 7:54

retrieve 0:20 analyze 2:22 finish 6:55

Nightly Automation Scheduling

Page 18: AT&T Proprietary (Internal Use Only) Not for use or disclosure outside the AT&T companies except under written agreement How to Manage Data Collection

Page 18

Monitoring Environment Results

• Nightly automation stats• 7 time zones• 39 states• 256 cities• 1,925 domains• 1,947 VIS files• 38,353 servers• 621,615 UDR files

• 13.5 new servers added per day over the last year (4,947)

Page 19: AT&T Proprietary (Internal Use Only) Not for use or disclosure outside the AT&T companies except under written agreement How to Manage Data Collection

Page 19

AT&T Capacity Planning Database

Page 20: AT&T Proprietary (Internal Use Only) Not for use or disclosure outside the AT&T companies except under written agreement How to Manage Data Collection

Page 20

AT&T Application Reporter

Page 21: AT&T Proprietary (Internal Use Only) Not for use or disclosure outside the AT&T companies except under written agreement How to Manage Data Collection

Page 21

Bonus MaterialBMC 7.5

Page 22: AT&T Proprietary (Internal Use Only) Not for use or disclosure outside the AT&T companies except under written agreement How to Manage Data Collection

Page 22

Performance Assurance Release 7.5 New Features and Functionality

New Virtualization Support• SUN Solaris Logical Domains (LDoms)• SUN Chip Multi-Threading (CMT) technology• IBM AIX Live Partition Mobility• IBM AIX Workload Partitions (WPARs)• IBM PowerVM• HP Integrity Virtual Machines (IVM)• Microsoft 2008 Virtualization Server (Hyper-V)

Enhanced VMware Virtualization Support• Cluster, resource pool, disk and datastore metrics• Info on relationships between servers, virtual machines, pools• Perceiver support for cluster, resource pool and disk views• Improvements to proxy data collector• Complete re-design of Visualizer tables and relationships

Page 23: AT&T Proprietary (Internal Use Only) Not for use or disclosure outside the AT&T companies except under written agreement How to Manage Data Collection

Page 23

Performance Assurance Release 7.5 New Features and Functionality (cont)

Console Operations• Improvements to Manager for recovery and reprocessing of data• Manager exception reports• Officially supported Service Levels• New General Manager web application to manage Perform and

Perceiver – daily operation and exceptions• UDR Transfer Utility• Changes to management of Hardware table for performance ratings• Changes to the Visualizer database structures• Problem resolutions and enhancement implementations

Page 24: AT&T Proprietary (Internal Use Only) Not for use or disclosure outside the AT&T companies except under written agreement How to Manage Data Collection

7.5 Migration Issues

Page 24

• Some Visualizer tables have been re-designed to accommodate metrics for virtual servers (current metrics may have moved to new tables).

• The changes in Visualizer require migrating all data from the old 7.4 schemas to the new 7.5 schemas.

• If multiple Consoles update the same Visualizer schema, all Consoles must be migrated to release 7.5 at the same time.

• The Visualizer database migrations must be done at the same time the Consoles are migrated to release 7.5.

• Therefore, in our environment, all Consoles and all Visualizer databases must be migrated to 7.5 simultaneously.

• Per BMC, very large Visualizer schemas may take longer than a day to migrate to 7.5 (we have 90+ Visualizer schemas).

• Per BMC, the most significant problems they have seen with the new release involves database migrations.