[김유진] Data Science, Big Data, and Analytics of IBM

Embed Size (px)

DESCRIPTION

[김유진] Data Science, Big Data, and Analytics of IBM

Citation preview

  • 2013 IBM Corporation

    Data Science, Big Data, and Analytics of IBM

  • 2013 IBM Corporation

    INDEX

    Part 1 About IBM IBM Research & Use Case Smarter Planet

    Part 2 Data Data Science Big Data Analytics

    2

  • 2013 IBM Corporation

    Part 1

    About IBM IBM Research & Use Case Smarter Planet

    3

  • 2013 IBM Corporation

    IBM

    4

    , IT

    ,

    : 1967 4 25, IBM1401 IBM 100%

    : 1,135

    ( ) 2011 2010 2009

    12,061 12,250 12,068

    1,304.4 987.9 633.5

    Premier Partner/ISV 65()

    Advanced Partner 73

    Member Partner 1,179

    (Distributor) 9

    37

    2011.03 45 -

    2011.01 IT ' (ACO) AA

    2009.11 IBM

    2008.12 IT '

    2008.09

    2007.04 IBM 40

    2007.03 1

    2004 3000

    2003 IBM

    2002 IBM (SI)

    IBM

    ()

  • 2013 IBM Corporation 5

    ,

    (, , DB)

    ,

    ,

    ,

    ,

    ,

    ,

    (GBS)

    /

    IT (GTS)

    (SWG)

    Database

    Web Application

    Groupware

    (STG)

    Unix Svr

    NT

    POS

    I Series

    (GPS)

    CRM

    (R&D)

    Trend

    (IGF)

    /

    /

    /

    /

    /

    /

  • 2013 IBM Corporation 6

    ,

    (, , DB)

    ,

    ,

    ,

    ,

    ,

    ,

    GTS SD

    GTS Service Delivery

    ITS

    ITS (Information Technology Services)

    SO Sales

    SO (Strategic

    Outsourcing) Sales

    SO Client

    Service

    SO (Strategic Outsourcing) Client Service

    MTS

    MTS (Maintenance and Technical

    Support)

    ST&MA&C

    M

    Strategy& Marketing*CM

    OFFERGROUP

    Offering Group

    ITS COVERAGE

    ITS

    ITS Delivery

    ITS Presale & Delivery

    ITS SALES

    Opportunity Owner

    Operation

    Growth Initiative

    Large Deal

    Consulting

    Service

    S&T (Strategy & Transformation)

    Sector

    FSS

    AMS & Delivery

    AMS

    Commercial

    Electronics

    I&G (Innovation & Growth)

    BAO (Business Analytics &

    Optimization)

    EA (Enterprise Applications)

    AIS (Application Innovation Service)

    Delivery Excellence

    Ops &

    Support

    Global Business Service

    Global Technology services

  • 2013 IBM Corporation 7

    IBM ,

    IBM , 6

    IBM 6

    IBM 10 , 12 IBM 16,000

    IBM GBS

    IT

    IBM GTS

    IT

    IBM GPS

    IBM STG

    IBM SWG

    ,

    IBM Financing

    /

    IT

  • 2013 IBM Corporation

    IBM Research ( ) 3,000 researchers in 12 labs

    Watson

    Ireland

    2010

    Australia

    2010

    New!

    Almaden

    1986

    1995

    Austin

    1961

    Zurich

    1955 1972

    Haifa Tokyo

    1982

    1998

    India

    2012

    Africa Brazil

    2010

    1995

    China

    4 labs participated in the Watson project

    Almaden

    1986

    1995

    Austin

  • 2013 IBM Corporation

    Analytics enable better Decisions for Water System Management (Washington D.C. Water and Sewer Authority)

    Replacement

    What is the state of the water delivery and sewage disposal?

    What is the best to allocate capital for infrastructure network upgrade?

    Failure Association

    How does environmental conditions impact failure?

    Does one brand hydrant fail more frequently than the other brand?

    How does aging process impact asset condition?

    PM Optimization

    Asset Failure & Risk

    Preventive Maintenance

    Can I reduce PM cost? Which failures are driving my water mains repair costs?

    Which pipes should I replace to prevent challenges next winter?

    Failure Prediction

    Which hydrant will fail most likely in the next 6 months?

    What type of failure will most likely happen given the current condition?

    How likely is the pipe segment going to fail?

    Application of these techniques in an engagement with Washington D.C. Water and Sewer Authority resulted in

    25% increase in maintenance crew utilization 30-50% cost savings on selected inspection and preventive maintenance significant revenue increase through loss prevention and differential pricing

  • 2013 IBM Corporation

    Preventive Maintenance for Water System (Washington D.C. Water and Sewer Authority)

    Min

    s.t.

    Inspection cost

    Repair cost Penalty cost

    Downtime (repair)

    Periodic inspection

    interval

    Max allowable periodic inspection interval (364 days)

    Optimize preventive maintenance time for each hydrant by considering the following factors:

    Inspection cost for PM before failure Repair cost given failure Penalty cost during downtime Failure risk

    PM time (days)

    # of hydrants

    (100,150] 1436

    (150, 200] 2153

    (200, 250] 2584

    > 250 1005 Maintenance

    planning

  • 2013 IBM Corporation

    Customized Weather Forecast

    Damage Model

    Outage Prediction

    Response Plan

    Data Assimilation

    Revised Outage

    Revised Response

    Plan Execution

    Report

    1 2 3 4 5 6 7 8

    Optimized Maintenance Plan

    Outage/Damage Prediction and Response Optimization (Utility Company)

    Prediction

    Optimization

    Real-time analytics

  • 2013 IBM Corporation

    Predicting Multi-Category Daily Damage Counts (Utility Company)

    Objective: Predict the daily multi-category damage counts based on the weather conditions on the region level

    Date range: 01/2010~02/2013

    Number or Records: 52, 206 for 34 regions

    Response Categories: 13 (C1~C13)

    Data Characteristics: - target: daily damage counts in multiple categories

    - predictors:

    1. Cumulative rainfall in the preceding two weeks;

    2. In Day 0, -1, and -2: aggregate the weather conditions

    Methods: Random Forests Model, Multivariate Poisson Regression Model

    Weather conditions

    Damages

    Day 0

    Day -1

    Day -2

    24 hour

    24 hour

    24 hour

    cumulative rainfall

    14-day window

    12AM

    temperature (min, max)

    rain rate (max)

    daily rain (max)

    monthly rain (max)

    humidity (max)

    average wind speed (max)

    wind gust speed (max)

    wind gust frequency

    pressure (min, max)

    C1, C2, C3

  • 2013 IBM Corporation

    Maintenance Scheduling (Semiconductor Manufacturing Plant)

    The scheduling problem for a wafer fab is a complex extension of the Resource Constrained Project Scheduling Problem that handles planned and unplanned orders.

    The objectives are to minimize the sum of the expected WIP in the time periods utilized by maintenance operations, minimize the number of technicians used, avoid performing maintenance early, satisfy business

    rules.

    The scheduling problem needs to integrate the Production schedule with the maintenance schedule so as to avoid maintenance during high demand for a machine

    The system is currently deployed and generating schedules daily at IBMs East Fishkill 300mm semi-conductor manufacturing plant.

    Maintenance

    Scheduling

  • 2013 IBM Corporation

    Anomaly Detection (Semiconductor Manufacturing) - Integrated Outlier Management in Tracer

    Information Theoretic

    Outlier Detection

    (Entropy Based)

    Comparison of the the chamber of interest

    and the control band from the other chambers

    (Mean m*Std)

    Outlier detection for the

    chamber of Interest

    (CUSUM Based Method )

    Step I

    Step II

    Objective: Exclude spurious values from score calculation.

    Method: an Integrated methodology consisting of Information Theoretic Method and Statistical Method; implemented in two steps.

    Step I: Calculation done in the context of the data from one chamber group, one recipe, one SVID, both time periods.

    Step II: Calculation done in the context of the data from one chamber group, one recipe, one SVID, single time period (reference/current).

    time

    SV

    ID

    UCL

    Chamber i

    LCL

    UCL_CUSUM

    outliers

  • 2013 IBM Corporation

    Process Monitoring (Semiconductor Manufacturing) - Hotellings T-squared Control Chart

    Objective: Design Hotellings T-squared control charts for manufacturing tools.

    Method: a complete procedure consisting of Phase-I design (initial study) and Phase II design (process monitoring)

    - Phase I: remove the outliers from the trace data collected from processes under normal conditions and calculate the in-control mean and covariance matrix;

    - Phase II: build the control chart using the in-control mean and covariance matrix from Phase I to monitor the current processes.

    0 50 100 150 200 250

    05

    10

    15

    20

    Hotelling's T-squared Control Chart

    Wafer Label

    T-s

    qu

    are

    d V

    alu

    e

    UCL Types

    UCL for Phase-I Design

    UCL for Phase-II Design

    Phase-I design

    Phase-II design

    Out of control

  • 2013 IBM Corporation

    Process Monitoring and Quality Control (Semiconductor Manufacturing) - Motivation for virtual metrology applications

    Virtual metrology (VM) generally refers to a model based prediction of some process outcome when there is no physical measurement of that outcome

    Predictive modeling: The underlying models are learned from histories of the actual physical outcomes and process trace data

    Benefits: Detect faulty wafers early

    Improve process control: from lot-to-lot wafer-to-wafer level

    Reduce physical measurements for process monitoring and control

    Throttle valve positions

    Electric bias, impedance, etc Gas flows

    Temperature & pressure

    Tools publish large amounts of real-time data

    Can we use the data for process control?

  • 2013 IBM Corporation

    Process Monitoring and Quality Control (Semiconductor Manufacturing) - Performance of VM-enhanced process control

    Simulation results for a given set of parameters:

    VM-EWMA : reduced process variance around 70%

    VM-LM : reduce process variance around 30%

    Given a target process variance, e.g. 0.03, we can reduce the measurement frequency

    VM-LM: 1 out of 6 wafers 1 out of 19 wafers

    VM-EWMA: 1 out of 6 wafers 1 out of 94 wafers

    0 50 100 1500

    0.01

    0.02

    0.03

    0.04

    0.05

    0.06

    0.07

    0.08

    Wafer Index

    Variance o

    f P

    rocess O

    utc

    om

    es LM

    VM-EWMA

    VM-LM

    LM:

    VM-LM:

    VM-EWMA:

  • 2013 IBM Corporation

    Business goal: Early anomaly detection to avoid emergency stops of the system

    Technical task: Detect anomalously behaving modules by comparing with previous normally-working state

    # of sensors ~ 100

    Technical hurdle: Nave thresholding for individual sensors is hard since the system frequently changes its operational mode

    Result: Detected about 60% of the serious faults that cannot be detected with conventional methods

    Anaconda captures the interdependency pattern between variables, and detects a deviation from the normal pattern

    Example:

    Example of detected faults

    air flow rate in

    take

    pre

    ssure

    air flow rate

    inta

    ke

    pre

    ssure

    normal faulty

    Power plant monitoring based on ANACONDA

  • 2013 IBM Corporation

    Unusual change in dependency

    IBM Anomaly Analyzer for Correlational Data (ANACONDA) leverages a unique dependency-based anomaly detection technology

    ANACONDA monitors the dependency among variables

    Setting a fixed threshold on individual variables leads to many false alerts for dynamic systems

    ANACONDA computes the anomaly score for individual variables

    Learns dependency patterns from past data under a normal condition

    Alert is raised if the present dependency is significantly different from the normal pattern

  • 2013 IBM Corporation

    Dependency discovery is a key technology

    ANACONDA leverages sparse structure learning technique for dependency discovery

    Automatically discovers important dependencies among sensors

    Dependency is indentified by building sensor-wise predictive models

    Sensor4

    Sensor5

    Sensor1

    Sensor2

    Sensor3

    Sensor6

    Sensor1

    Sensor4

    Sensor5

    Sensor1

    Sensor2

    Sensor3

    Sensor6

    Sensor2

    Repeated until

    convergence

  • 2013 IBM Corporation

    VoC FAQ

    INBOUND

    OUTBOUND

    ,

    , ,

    IBM

    Healthcare advisor Engagement advisor

    ,

    , Q&A

    FAQ

  • 2013 IBM Corporation 22

    http://www.ibm.com/smarterplanet/kr/ko/overview/ideas/index.html

    Smarter Planet

  • 2013 IBM Corporation

    Part 2

    Data Data Science Big Data Analytics

    23

  • 2013 IBM Corporation

    Data = Digitialization of all things

    24

    Text

    Number

    Sound Signal

    Image

    (, , , ) Amount (, , ), , DNA,

    Data Type Form/Meaning

    Video

    Transformed

    SNS, , , WEB, , ,

    , , , , , ,

    Number

    Number

    Number

    , , , ,

    Number

    Number + Text WEB LOG, , , , , , Number

    , , , ,

    , CCTV, UCC,

    Feature , ,

    , , , ,

    Text

    Feature , ,

  • 2013 IBM Corporation 25

    Data Science = Handeling of Digital Information

  • 2013 IBM Corporation 26

    Data Scientist of Korea = Group of Speciailst

    IT System

    DB

    (R)

    System

    IT Architect IT Outsorcing

  • 2013 IBM Corporation

    27

    Data Scientist of Big Data

  • 2013 IBM Corporation 28

    Big Data

  • 2013 IBM Corporation

    Predictive analytics at the heart of the enterprise LOB 3

    LOB 2

    LOB 1

    Customer

    Interactions

    Corporate Goals

    Risk

    Retain

    Grow

    Attract

    Fraud

    Channels

    Moments of Truth

    I buy

    I renew

    I claim

    I mend

    I cancel

    Business Processes

    Customer Support

    Claims Processing

    Underwriting

    Fraud Management

    Sales Effectiveness

    Marketing

    Optimized Business Processes

    Customer Support

    Claims Processing

    Underwriting

    Fraud Management

    Sales Effectiveness

    Marketing

    Analytical Foresight

    Claims Profile

    Fraud Risk

    Customer LTV

    Retention Risk

    Best Offers

    Customer Experience

    Optimal Campaigns

    Risk Assessment

    Pla

    tfo

    rm

    Data Mining & Statistics

    Decision Optimization

    Data Collection

    Base Services

    Visualization

    Attitudinal

    Data

    Interaction

    Data

    Behavioral

    Data

    Demographic

    Data

    Customer

    Feedback

    29

  • 2013 IBM Corporation

    Visualization & Discovery Integration

    Workload Optimization Streams

    Netezza

    Flume

    DB2

    DataStage

    IBM InfoSphere BigInsights

    Runtime / Scheduler

    Advanced Analytic Engines

    File System

    MapReduce

    HDFS

    Data Store HBase

    Text Processing Engine & Extractor Library)

    BigSheets JDBC

    Applications & Development

    Text Analytics MapReduce

    Pig & Jaql Hive

    Administration

    Index

    Splittable Text Compression

    Enhanced Security

    Flexible Scheduler

    Jaql

    Pig

    ZooKeeper

    Lucene

    Oozie

    Adaptive MapReduce

    Hive

    Integrated Installer

    Admin Console

    Sqoop

    Adaptive Algorithms

    Dashboard & Visualization

    Apps

    Workflow Monitoring

    Management

    HCatalog

    Security

    Audit & History

    Lineage

    R

    Guardium

    Platform Computing

    Cognos

    IBM Open Source

    Symphony

    GPFS FPO

    Optional

    Symphony AE

    The IBM Big Data Platform Big Data

  • 2013 IBM Corporation 31