35
云深度学习平台架构与实践 陈迪豪 / 崔建伟

云深度学习平台架构与实践 陈迪豪 崔建伟 - Huodongjia.com · 2017. 7. 28. · Google Cloud Amazon Web Service Microsoft Azure Cloud ... EC2 SaaS MXNet Studio SaaS

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

  • 云深度学习平台架构与实践陈迪豪 / 崔建伟

  • About Us

    崔建伟 ⼩小⽶米深度学习平台架构师

    陈迪豪 第四范式先知平台架构师

  • Agenda

    ❖ Define Cloud Machine Learning

    ❖ Re-define Cloud Machine Learning

    ❖ Cloud-ML at 4Paradigm

    ❖ Cloud-ML at Xiaomi

  • Agenda

    ❖ Define Cloud Machine Learning

    ❖ Re-define Cloud Machine Learning

    ❖ Cloud-ML at 4Paradigm

    ❖ Cloud-ML at Xiaomi

  • Define Cloud Machine Learning

    ! What is Machine Learning

    MLPCNN RNN/LSTM RL

  • Define Cloud Machine Learning

    ! What is Cloud Machine Learning

    Google Cloud Machine Learning Engine Amazon Machine Learning Azure Machine Learning Studio

    Google Cloud Amazon Web Service Microsoft Azure Cloud

    Training

    TensorFlow TensorFlow

    EC2 SaaS

    MXNet

    Studio SaaS

    CNTK

    Prediction

  • Define Cloud Machine Learning

    ! Why Cloud Machine Learning

    ! Train in local machine ! No resource isolation ! No resource sharing ! No cluster orchestration ! No auto-scaling ! No automatical failover Example: pip install tensorflow

  • Define Cloud Machine Learning

    ! Architecture of Cloud Machine Learning

    Cloud Platform Layer

    Machine Learning Layer

    Application Layer

    Kubernetes / OpenStack / …

    Training / Prediction / …

    TensorFlow / MXNet / …

  • Define Cloud Machine Learning

    ! Architecture of Cloud Machine Learning

    模型开发 训练任务

    线上服务

  • Define Cloud Machine Learning

    ! Architecture of Google-like Cloud Machine Learning

    API Service

    TensorFlow

    K8S ClusterClient TF Serving

    Online Req

    Submit train job

    Create model service

    Submit prediction job Create predict container

    Create model container

    Create train container MXNet

    RESTful

    Offline Req

  • Define Cloud Machine Learning

    ! Architecture of Google-like Cloud Machine Learning

    Step 1: Build docker image Step 2: Implement API service Step 3: Submit to Kubernetes

  • Agenda

    ❖ Define Cloud Machine Learning

    ❖ Re-define Cloud Machine Learning

    ❖ Cloud-ML at 4Paradigm

    ❖ Cloud-ML at Xiaomi

  • ! TensorFlow vs Hadoop ! TensorFlow vs Spark ! TensorFlow vs Hive ! TensorFlow vs PowerGraph ! TensorFlow vs Azure ML Studio

    ! TensorFlow vs H2O / Dataiku / 数加

    Re-define Cloud Machine Learning

  • ! We need all of these!

    Re-define Cloud Machine Learning

    ! HDFS: for large data storage ! Hive: for data preprocessing ! Spark: for feature extraction ! Hadoop: for task scheduling ! TensorFlow: for model training ! Kubernetes: for CPU/GPU management

    “Super-machine-learning-man”

  • ! We want all of these!

    ! Closed-loop from data preprocessing to online services ! Feature extraction without writing code ! Easy to define machine learning process ! Flexible and heterogeneous infrastructure ! Automatically failover and scaling ! Easy to use for the domain experts

    Re-define Cloud Machine Learning

  • Agenda

    ❖ Define Cloud Machine Learning

    ❖ Re-define Cloud Machine Learning

    ❖ Cloud-ML at 4Paradigm

    ❖ Cloud-ML at Xiaomi

  • !先知平台

    Cloud-ML at 4Paradigm

  • !先知平台

    Cloud-ML at 4Paradigm

    !简化数据引⼊入,⽀支持RDBMS和HDFS数据源

    !简化数据拆分,⽀支持按⽐比例例拆分和按规则拆分

    !简化特征抽取,⽀支持连续特征和离散特征的组合

    !简化模型训练,⽀支持⾃自研超⾼高维度LR和开源框架算法

    !简化模型评估,⽀支持ROC、Logloss、K-S等评估指标

  • !先知平台

    Cloud-ML at 4Paradigm

    某国Top1的新闻App推荐,优化点击率提升34%

    某知识分享领域Top3 App⾳音频推荐,优化听完率提升43%

    某秀场类直播Top3 App主播推荐,优化收看时⻓长提升21%

    某国内最⼤大的UGC社区内容推荐,优化点击率提升93%

    ⽤用户喜欢

    ⽤用户⽆无感

    机器器学习个性化推荐

    ⽤用户喜欢

    ⽤用户⽆无感

    运营⼩小编专家经验规则

    机器器学习模型推荐

  • Cloud-ML at 4Paradigm

    prophet.4paradigm.com

    http://prophet.4paradigm.com

  • Agenda

    ❖ Define Cloud Machine Learning

    ❖ Re-define Cloud Machine Learning

    ❖ Cloud-ML at 4Paradigm

    ❖ Cloud-ML at Xiaomi

  • Cloud-ML 架构

    ⼩小⽶米⽣生态云

    ⼩小⽶米融合云

    Cloud-ML

    FDS(⼩小⽶米⽂文件存储服务)

    Docker + Kubernets

    PaaS SaaS Dev Training Serving Vision NLP ASR

  • Cloud-ML 主要功能

  • Cloud-ML 使⽤用情况

    Cloud-ML

    4个集群部署

    150+开发者使⽤用

    20+⼩小⽶米内部业务接⼊入

    5家⼩小⽶米⽣生态链公司接⼊入

  • Cloud-ML 实践: PaaS改进! Dev环境

    !提供模型开发功能

    !实现:

    !提供主流计算框架镜像,⽀支持ssh

    !以kubernetes service运⾏行行计算框架

    !问题:

    ! Pod可能被重新调度

    !数据持久化

    !端⼝口对外开放

  • Cloud-ML 实践: PaaS改进! Dev环境数据持久化

    ! Fuse

    !⽀支持主要的Posix接⼝口

    ! FDS⽀支持Fuse

    !⽀支持将⽤用户Bucket挂载到本地

    ! Kubernetes⽀支持Fuse

    !启动Dev Pod时挂载/dev/fuse

    ! Cloud-ML⽀支持Fuse

    !创建Dev时Mount FDS Bucket

  • Cloud-ML 实践: PaaS改进! Dev环境端⼝口开放

    !需求: 在Dev环境中开放可以被外部访问的端⼝口

    !⽅方案:

    ! HAProxy实现转发

    !转发节点以service启动

    !防⽕火墙规则配置Eip: hostport

    Docker Proxy Kube-proxy

    Public Access

    HAProxyDev

    转发节点 计算节点

  • Cloud-ML 实践: PaaS改进! Serving 服务发现

    !现状:

    !配置Nodeport

    !控制节点转发到service

    !问题:

    !控制节点单点

    ! Port标识对业务不不友好

    !⽅方案:name service

    name service

    collector

    service: pods

    service

    request

    pods

    Kubernetesservice1:pods service2:pods …

    client

  • Cloud-ML 实践: SaaS服务

    图像识别

    ⾃自然语⾔言处理理

    语⾳音识别

  • Cloud-ML 实践: SaaS服务!使⽤用场景

    图像/语⾳音/⽂文本

    智能设备 App Server Cloud-ML SaaS

    FDS(⼩小⽶米⽂文件存储服务)

  • Cloud-ML 实践: SaaS服务

    ⼈人脸检测: ⼈人脸位置、性别、年年龄物体识别: 1500+ 物体分类(包括客厅、卧室等场景)

    FaceInfo: topX: 208 topY: 73 width: 403 height: 403 child female age: 5.6

    图像识别

  • Cloud-ML 实践: SaaS服务

    ⼈人脸检测: ⼈人脸位置、性别、年年龄物体识别: 1500+ 物体分类(包括客厅、卧室等场景)

    图像识别

    物体 置信度

    客厅(living room) 0.52

    餐厅(dining room) 0.14

    ⼤大厅(hall) 0.08

    休闲室(waiting room) 0.06

  • Cloud-ML 实践: SaaS服务!图像识别

  • Cloud-ML 实践!将来的⼯工作 ! PaaS

    !⽀支持更更多的训练框架

    ! Kaldi, CNTK

    ! Dev环境状态可保存

    !资源超卖

    !与数据处理理流程⽆无缝集成 ! SaaS

    !上线更更多模型服务