1117
MapReduce 服务 用户指南 文档版本 15 发布日期 2020-03-18 华为技术有限公司

MapReduce 服务 - Huawei › intl › zh-cn › usermanual-mrs › mrs … · 4.16.61 ALM-38001 Kafka磁盘容量不足.....307 4.16.62 ALM-38002 Kafka堆内存使用率超过阈值.....310

  • Upload
    others

  • View
    18

  • Download
    0

Embed Size (px)

Citation preview

  • MapReduce 服务

    用户指南

    文档版本 15

    发布日期 2020-03-18

    华为技术有限公司

  • 版权所有 © 华为技术有限公司 2020。 保留一切权利。

    非经本公司书面许可,任何单位和个人不得擅自摘抄、复制本文档内容的部分或全部,并不得以任何形式传播。 商标声明

    和其他华为商标均为华为技术有限公司的商标。本文档提及的其他所有商标或注册商标,由各自的所有人拥有。 注意

    您购买的产品、服务或特性等应受华为公司商业合同和条款的约束,本文档中描述的全部或部分产品、服务或特性可能不在您的购买或使用范围之内。除非合同另有约定,华为公司对本文档内容不做任何明示或默示的声明或保证。

    由于产品版本升级或其他原因,本文档内容会不定期进行更新。除非另有约定,本文档仅作为使用指导,本文档中的所有陈述、信息和建议不构成任何明示或暗示的担保。

    文档版本 15 (2020-03-18) 版权所有 © 华为技术有限公司 i

  • 目 录

    1 IAM 权限管理............................................................................................................................. 11.1 创建用户并授权使用 MRS...................................................................................................................................................... 11.2 MRS 自定义策略....................................................................................................................................................................... 21.3 IAM 用户同步 MRS.................................................................................................................................................................. 3

    2 入门..............................................................................................................................................82.1 如何使用 MRS............................................................................................................................................................................82.2 创建集群......................................................................................................................................................................................82.3 上传示例数据和程序............................................................................................................................................................. 102.4 添加作业................................................................................................................................................................................... 132.5 删除集群................................................................................................................................................................................... 17

    3 配置集群....................................................................................................................................193.1 概览............................................................................................................................................................................................ 193.2 集群列表简介...........................................................................................................................................................................203.3 购买方式简介...........................................................................................................................................................................233.4 快速购买 Hadoop 分析集群................................................................................................................................................ 233.5 快速购买 HBase 分析集群................................................................................................................................................... 243.6 快速购买 Kafka 流式集群.................................................................................................................................................... 253.7 自定义购买集群...................................................................................................................................................................... 263.8 创建集群................................................................................................................................................................................... 393.9 创建最小规格集群.................................................................................................................................................................. 493.10 配置存算分离集群............................................................................................................................................................... 513.11 添加集群标签........................................................................................................................................................................ 583.12 通过引导操作安装第三方软件..........................................................................................................................................603.12.1 引导操作简介.....................................................................................................................................................................603.12.2 准备引导操作脚本............................................................................................................................................................ 613.12.3 查看执行记录.....................................................................................................................................................................613.12.4 添加引导操作.....................................................................................................................................................................623.12.5 脚本样例............................................................................................................................................................................. 65

    4 管理现有集群............................................................................................................................724.1 查看和监控集群...................................................................................................................................................................... 724.1.1 查看集群基本信息.............................................................................................................................................................. 724.1.2 查看集群补丁信息.............................................................................................................................................................. 75

    MapReduce 服务用户指南 目 录

    文档版本 15 (2020-03-18) 版权所有 © 华为技术有限公司 ii

  • 4.1.3 查看和定制集群监控指标................................................................................................................................................. 764.1.4 管理组件和主机监控.......................................................................................................................................................... 774.2 扩容集群................................................................................................................................................................................... 834.3 缩容集群................................................................................................................................................................................... 854.4 配置弹性伸缩规则.................................................................................................................................................................. 874.5 创建集群时配置弹性伸缩规则............................................................................................................................................ 994.6 升级 Master 节点规格........................................................................................................................................................ 1054.7 配置消息通知........................................................................................................................................................................ 1064.8 运维......................................................................................................................................................................................... 1084.8.1 运维授权............................................................................................................................................................................. 1084.8.2 日志共享............................................................................................................................................................................. 1094.9 删除集群................................................................................................................................................................................. 1104.10 退订集群.............................................................................................................................................................................. 1104.11 删除失败任务......................................................................................................................................................................1114.12 作业管理.............................................................................................................................................................................. 1114.12.1 MRS 作业简介................................................................................................................................................................. 1114.12.2 运行 MapReduce 作业..................................................................................................................................................1154.12.3 运行 Spark 作业..............................................................................................................................................................1204.12.4 运行 HiveSql 作业.......................................................................................................................................................... 1244.12.5 运行 SparkSql 作业........................................................................................................................................................1284.12.6 运行 Flink 作业............................................................................................................................................................... 1334.12.7 运行 Kafka 作业............................................................................................................................................................. 1354.12.8 查看作业配置信息和日志.............................................................................................................................................1374.12.9 停止作业........................................................................................................................................................................... 1374.12.10 复制作业........................................................................................................................................................................ 1384.12.11 删除作业........................................................................................................................................................................ 1404.12.12 使用 OBS 加密数据运行作业.................................................................................................................................... 1404.12.13 配置作业消息通知....................................................................................................................................................... 1474.13 管理数据文件......................................................................................................................................................................1484.14 组件管理.............................................................................................................................................................................. 1524.14.1 对象管理简介.................................................................................................................................................................. 1524.14.2 查看配置........................................................................................................................................................................... 1534.14.3 管理服务操作.................................................................................................................................................................. 1554.14.4 配置服务参数.................................................................................................................................................................. 1574.14.5 配置服务自定义参数..................................................................................................................................................... 1604.14.6 同步服务配置.................................................................................................................................................................. 1644.14.7 管理角色实例操作..........................................................................................................................................................1654.14.8 配置角色实例参数..........................................................................................................................................................1674.14.9 同步角色实例配置..........................................................................................................................................................1704.14.10 退服和入服务角色实例.............................................................................................................................................. 1724.14.11 管理主机(节点)操作.............................................................................................................................................. 1734.14.12 隔离主机........................................................................................................................................................................ 174

    MapReduce 服务用户指南 目 录

    文档版本 15 (2020-03-18) 版权所有 © 华为技术有限公司 iii

  • 4.14.13 取消隔离主机................................................................................................................................................................ 1754.14.14 启动及停止集群........................................................................................................................................................... 1764.14.15 同步集群配置................................................................................................................................................................ 1774.14.16 导出集群的配置数据................................................................................................................................................... 1774.14.17 支持滚动重启................................................................................................................................................................ 1784.15 告警管理.............................................................................................................................................................................. 1864.15.1 查看告警列表.................................................................................................................................................................. 1864.15.2 查看与手动清除告警..................................................................................................................................................... 1874.16 告警参考.............................................................................................................................................................................. 1884.16.1 ALM-12001 审计日志转储失败................................................................................................................................. 1884.16.2 ALM-12002 HA 资源异常........................................................................................................................................... 1894.16.3 ALM-12004 OLdap 资源异常.....................................................................................................................................1924.16.4 ALM-12005 OKerberos 资源异常............................................................................................................................. 1934.16.5 ALM-12006 节点故障...................................................................................................................................................1944.16.6 ALM-12007 进程故障...................................................................................................................................................1964.16.7 ALM-12010 Manager 主备节点间心跳中断.......................................................................................................... 1974.16.8 ALM-12011 Manager 主备节点同步数据异常......................................................................................................1994.16.9 ALM-12012 NTP 服务异常......................................................................................................................................... 2004.16.10 ALM-12016 CPU 使用率超过阈值.......................................................................................................................... 2034.16.11 ALM-12017 磁盘容量不足........................................................................................................................................2044.16.12 ALM-12018 内存使用率超过阈值...........................................................................................................................2064.16.13 ALM-12027 主机 PID 使用率超过阈值................................................................................................................. 2084.16.14 ALM-12028 主机 D 状态进程数超过阈值.............................................................................................................2094.16.15 ALM-12031 omm 用户或密码即将过期............................................................................................................... 2114.16.16 ALM-12032 ommdba 用户或密码即将过期........................................................................................................2124.16.17 ALM-12033 慢盘故障................................................................................................................................................ 2134.16.18 ALM-12034 周期备份任务失败............................................................................................................................... 2144.16.19 ALM-12035 恢复失败后数据状态未知.................................................................................................................. 2154.16.20 ALM-12037 NTP 服务器异常...................................................................................................................................2164.16.21 ALM-12038 监控指标转储失败............................................................................................................................... 2184.16.22 ALM-12039 GaussDB 主备数据不同步.................................................................................................................2204.16.23 ALM-12040 系统熵值不足........................................................................................................................................2224.16.24 ALM-13000 ZooKeeper 服务不可用......................................................................................................................2244.16.25 ALM-13001 ZooKeeper 可用连接数不足............................................................................................................. 2264.16.26 ALM-13002 ZooKeeper 内存使用量超过阈值.................................................................................................... 2284.16.27 ALM-14000 HDFS 服务不可用................................................................................................................................ 2304.16.28 ALM-14001 HDFS 磁盘空间使用率超过阈值......................................................................................................2324.16.29 ALM-14002 DataNode 磁盘空间使用率超过阈值.............................................................................................2344.16.30 ALM-14003 丢失的 HDFS 块数量超过阈值......................................................................................................... 2354.16.31 ALM-14004 损坏的 HDFS 块数量超过阈值......................................................................................................... 2374.16.32 ALM-14006 HDFS 文件数超过阈值....................................................................................................................... 2384.16.33 ALM-14007 HDFS NameNode 内存使用率超过阈值.......................................................................................239

    MapReduce 服务用户指南 目 录

    文档版本 15 (2020-03-18) 版权所有 © 华为技术有限公司 iv

  • 4.16.34 ALM-14008 HDFS DataNode 内存使用率超过阈值......................................................................................... 2404.16.35 ALM-14009 故障 DataNode 数量超过阈值.........................................................................................................2424.16.36 ALM-14010 NameService 服务异常..................................................................................................................... 2444.16.37 ALM-14011 HDFS DataNode 数据目录配置不合理......................................................................................... 2464.16.38 ALM-14012 HDFS Journalnode 数据不同步...................................................................................................... 2494.16.39 ALM-16000 连接到 HiveServer 的 session 数占最大允许数的百分比超过阈值........................................2514.16.40 ALM-16001 Hive 数据仓库空间使用率超过阈值............................................................................................... 2524.16.41 ALM-16002 Hive SQL 执行成功率低于阈值....................................................................................................... 2544.16.42 ALM-16004 Hive 服务不可用.................................................................................................................................. 2574.16.43 ALM-18000 Yarn 服务不可用.................................................................................................................................. 2604.16.44 ALM-18002 NodeManager 心跳丢失...................................................................................................................2624.16.45 ALM-18003 NodeManager 不健康....................................................................................................................... 2634.16.46 ALM-18006 执行 MapReduce 任务超时.............................................................................................................. 2644.16.47 ALM-19000 HBase 服务不可用.............................................................................................................................. 2664.16.48 ALM-19006 HBase 容灾同步失败.......................................................................................................................... 2674.16.49 ALM-25000 LdapServer 服务不可用..................................................................................................................... 2704.16.50 ALM-25004 LdapServer 数据同步异常.................................................................................................................2714.16.51 ALM-25500 KrbServer 服务不可用........................................................................................................................2744.16.52 ALM-27001 DBService 服务不可用....................................................................................................................... 2754.16.53 ALM-27003 DBService 主备节点间心跳中断...................................................................................................... 2784.16.54 ALM-27004 DBService 主备数据不同步...............................................................................................................2794.16.55 ALM-28001 Spark 服务不可用................................................................................................................................2814.16.56 ALM-26051 Storm 服务不可用............................................................................................................................... 2834.16.57 ALM-26052 Storm 服务可用 Supervisor 数量小于阈值.................................................................................. 2854.16.58 ALM-26053 Storm Slot 使用率超过阈值............................................................................................................. 2864.16.59 ALM-26054 Storm Nimbus 堆内存使用率超过阈值.........................................................................................2884.16.60 ALM-38000 Kafka 服务不可用................................................................................................................................2904.16.61 ALM-38001 Kafka 磁盘容量不足........................................................................................................................... 2914.16.62 ALM-38002 Kafka 堆内存使用率超过阈值.......................................................................................................... 2944.16.63 ALM-24000 Flume 服务不可用...............................................................................................................................2954.16.64 ALM-24001 Flume Agent 异常.............................................................................................................................. 2974.16.65 ALM-24003 Flume Client 连接中断...................................................................................................................... 2984.16.66 ALM-24004 Flume 读取数据异常.......................................................................................................................... 3004.16.67 ALM-24005 Flume 传输数据异常.......................................................................................................................... 3024.16.68 ALM-12041 关键文件权限异常............................................................................................................................... 3044.16.69 ALM-12042 关键文件配置异常............................................................................................................................... 3064.16.70 ALM-23001 Loader 服务不可用............................................................................................................................. 3074.16.71 ALM-12357 审计日志导出到 OBS 失败................................................................................................................ 3114.16.72 ALM-12014 设备分区丢失........................................................................................................................................3124.16.73 ALM-12015 设备分区文件系统只读...................................................................................................................... 3144.16.74 ALM-12043 DNS 解析时长超过阈值..................................................................................................................... 3154.16.75 ALM-12045 网络读包丢包率超过阈值.................................................................................................................. 317

    MapReduce 服务用户指南 目 录

    文档版本 15 (2020-03-18) 版权所有 © 华为技术有限公司 v

  • 4.16.76 ALM-12046 网络写包丢包率超过阈值.................................................................................................................. 3214.16.77 ALM-12047 网络读包错误率超过阈值.................................................................................................................. 3234.16.78 ALM-12048 网络写包错误率超过阈值.................................................................................................................. 3244.16.79 ALM-12049 网络读吞吐率超过阈值...................................................................................................................... 3264.16.80 ALM-12050 网络写吞吐率超过阈值...................................................................................................................... 3284.16.81 ALM-12051 磁盘 Inode 使用率超过阈值............................................................................................................. 3294.16.82 ALM-12052 TCP 临时端口使用率超过阈值......................................................................................................... 3314.16.83 ALM-12053 文件句柄使用率超过阈值.................................................................................................................. 3334.16.84 ALM-12054 证书文件失效........................................................................................................................................3354.16.85 ALM-12055 证书文件即将过期............................................................................................................................... 3374.16.86 ALM-18008 Yarn ResourceManager 堆内存使用率超过阈值........................................................................3394.16.87 ALM-18009 MapReduce JobHistoryServer 堆内存使用率超过阈值............................................................3414.16.88 ALM-20002 Hue 服务不可用...................................................................................................................................3424.16.89 ALM-43001 Spark 服务不可用................................................................................................................................3454.16.90 ALM-43006 JobHistory 进程堆内存使用超出阈值............................................................................................ 3464.16.91 ALM-43007 JobHistory 进程非堆内存使用超出阈值........................................................................................ 3474.16.92 ALM-43008 JobHistory 进程直接内存使用超出阈值........................................................................................ 3494.16.93 ALM-43009 JobHistory GC 时间超出阈值...........................................................................................................3504.16.94 ALM-43010 JDBCServer 进程堆内存使用超出阈值...........................................................................................3524.16.95 ALM-43011 JDBCServer 进程非堆内存使用超出阈值...................................................................................... 3534.16.96 ALM-43012 JDBCServer 进程直接内存使用超出阈值...................................................................................... 3544.16.97 ALM-43013 JDBCServer GC 时间超出阈值......................................................................................................... 3564.16.98 ALM-44004 Presto Coordinator 资源组排队任务超过阈值............................................................................3574.16.99 ALM-44005 Presto Coordinator 进程垃圾收集时间超出阈值....................................................................... 3584.16.100 ALM-44006 Presto Worker 进程垃圾收集时间超出阈值.............................................................................. 3604.16.101 ALM-18010 Yarn 任务挂起数超过阈值.............................................................................................................. 3614.16.102 ALM-18011 Yarn 任务挂起内存超过阈值.......................................................................................................... 3624.16.103 ALM-18012 上个周期被终止的 Yarn 任务数超过阈值................................................................................... 3644.16.104 ALM-18013 上个周期运行失败的 Yarn 任务数超过阈值...............................................................................3654.16.105 ALM-16005 上个周期 Hive SQL 执行失败超过阈值.......................................................................................3654.17 补丁管理.............................................................................................................................................................................. 3664.17.1 MRS 1.8.5 后版本补丁操作指导.................................................................................................................................3664.17.2 滚动补丁........................................................................................................................................................................... 3674.17.3 修复隔离主机补丁..........................................................................................................................................................3704.18 MRS 补丁说明.................................................................................................................................................................... 3714.18.1 MRS 1.8.10.1 补丁说明................................................................................................................................................ 3714.18.2 MRS 2.0.1.1 补丁说明................................................................................................................................................... 3714.18.3 MRS 2.0.1.2 补丁说明................................................................................................................................................... 3724.18.4 MRS 2.0.1.3 补丁说明................................................................................................................................................... 3734.18.5 MRS 2.0.6.1 补丁说明................................................................................................................................................... 3734.18.6 MRS 2.1.0.1 补丁说明................................................................................................................................................... 3744.18.7 MRS 2.1.0.2 补丁说明................................................................................................................................................... 375

    MapReduce 服务用户指南 目 录

    文档版本 15 (2020-03-18) 版权所有 © 华为技术有限公司 vi

  • 4.18.8 MRS 2.1.0.3 补丁说明................................................................................................................................................... 3764.18.9 MRS 2.1.0.6 补丁说明................................................................................................................................................... 3774.18.10 MRS 2.1.0.7 补丁说明................................................................................................................................................ 3804.19 日志管理.............................................................................................................................................................................. 3824.19.1 关于日志........................................................................................................................................................................... 3824.19.2 Manager 日志清单........................................................................................................................................................ 3944.19.3 查看及导出审计日志..................................................................................................................................................... 4014.19.4 导出服务日志.................................................................................................................................................................. 4034.19.5 配置审计日志导出参数................................................................................................................................................. 4034.20 健康检查管理......................................................................................................................................................................4054.20.1 执行健康检查.................................................................................................................................................................. 4054.20.2 查看并导出检查报告..................................................................................................................................................... 4064.20.3 DBService 健康检查指标项说明.................................................................................................................................4064.20.4 Flume 健康检查指标项说明........................................................................................................................................4074.20.5 HBase 健康检查指标项说明........................................................................................................................................4074.20.6 Host 健康检查指标项说明........................................................................................................................................... 4074.20.7 HDFS 健康检查指标项说明......................................................................................................................................... 4144.20.8 Hive 健康检查指标项说明........................................................................................................................................... 4144.20.9 Kafka 健康检查指标项说明......................................................................................................................................... 4154.20.10 KrbServer 健康检查指标项说明...............................................................................................................................4154.20.11 LdapServer 健康检查指标项说明............................................................................................................................ 4164.20.12 Loader 健康检查指标项说明.................................................................................................................................... 4174.20.13 MapReduce 健康检查指标项说明...........................................................................................................................4184.20.14 OMS 健康检查指标项说明........................................................................................................................................ 4184.20.15 Spark 健康检查指标项说明.......................................................................................................................................4224.20.16 Storm 健康检查指标项说明...................................................................................................................................... 4224.20.17 Yarn 健康检查指标项说明......................................................................................................................................... 4234.20.18 ZooKeeper 健康检查指标项说明.............................................................................................................................4234.21 租户管理.............................................................................................................................................................................. 4244.21.1 租户简介........................................................................................................................................................................... 4244.21.2 添加租户........................................................................................................................................................................... 4254.21.3 添加子租户.......................................................................................................................................................................4284.21.4 删除租户........................................................................................................................................................................... 4314.21.5 管理租户目录.................................................................................................................................................................. 4324.21.6 恢复租户数据.................................................................................................................................................................. 4354.21.7 添加资源池.......................................................................................................................................................................4364.21.8 修改资源池.......................................................................................................................................................................4374.21.9 删除资源池.......................................................................................................................................................................4384.21.10 配置队列........................................................................................................................................................................ 4404.21.11 配置资源池的队列容量策略...................................................................................................................................... 4414.21.12 清除队列配置................................................................................................................................................................ 4424.22 备份与恢复.......................................................................................................................................................................... 444

    MapReduce 服务用户指南 目 录

    文档版本 15 (2020-03-18) 版权所有 © 华为技术有限公司 vii

  • 4.22.1 备份与恢复简介.............................................................................................................................................................. 4444.22.2 备份元数据.......................................................................................................................................................................4464.22.3 恢复元数据.......................................................................................................................................................................4484.22.4 修改备份任务.................................................................................................................................................................. 4504.22.5 查看备份恢复任务..........................................................................................................................................................4524.23 安全管理.............................................................................................................................................................................. 4534.23.1 未开启 Kerberos 认证集群中的默认用户清单........................................................................................................4534.23.2 开启 Kerberos 认证集群中的默认用户清单............................................................................................................ 4564.23.3 修改操作系统用户密码................................................................................................................................................. 4614.23.4 修改 admin 密码............................................................................................................................................................ 4614.23.5 修改 Kerberos 管理员密码...........................................................................................................................................4634.23.6 修改 LDAP 管理员和 LDAP 用户密码....................................................................................................................... 4644.23.7 修改组件运行用户密码................................................................................................................................................. 4654.23.8 修改 OMS 数据库管理员密码..................................................................................................................................... 4664.23.9 修改 OMS 数据库数据访问用户密码........................................................................................................................ 4674.23.10 修改组件数据库用户密码.......................................................................................................................................... 4674.23.11 更换 HA 证书................................................................................................................................................................ 4684.23.12 更新集群密钥................................................................................................................................................................ 4704.24 MRS 多用户权限管理....................................................................................................................................................... 4714.24.1 MRS 集群中的用户与权限........................................................................................................................................... 4714.24.2 开启 Kerberos 认证集群中的默认用户清单............................................................................................................ 4754.24.3 创建角色........................................................................................................................................................................... 4804.24.4 创建用户组.......................................................................................................................................................................4854.24.5 创建用户........................................................................................................................................................................... 4864.24.6 修改用户信息.................................................................................................................................................................. 4874.24.7 锁定用户........................................................................................................................................................................... 4884.24.8 解锁用户........................................................................................................................................................................... 4894.24.9 删除用户........................................................................................................................................................................... 4904.24.10 修改操作用户密码....................................................................................................................................................... 4914.24.11 初始化系统用户密码................................................................................................................................................... 4924.24.12 下载用户认证文件....................................................................................................................................................... 4934.24.13 修改密码策略................................................................................................................................................................ 4944.24.14 配置跨集群互信........................................................................................................................................................... 4954.24.15 配置并使用互信集群的用户...................................................................................................................................... 4994.24.16 配置 MRS 多用户访问 OBS 细粒度权限................................................................................................................ 500

    5 管理历史集群..........................................................................................................................5055.1 查看历史集群基本信息....................................................................................................................................................... 505

    6 查看操作日志..........................................................................................................................509

    7 管理数据连接..........................................................................................................................511

    8 连接集群................................................................................................................................. 5168.1 登录集群................................................................................................................................................................................. 516

    MapReduce 服务用户指南 目 录

    文档版本 15 (2020-03-18) 版权所有 © 华为技术有限公司 viii

  • 8.1.1 集群节点简介.....................................................................................................................................................................5168.1.2 登录集群节点.....................................................................................................................................................................5178.1.3 如何确认 MRS Manger 的主备管理节点................................................................................................................... 5228.2 使用 MRS 客户端................................................................................................................................................................. 5238.2.1 集群内节点使用 MRS 客户端........................................................................................................................................ 5248.2.2 集群外节点使用 MRS 客户端........................................................................................................................................ 5258.2.3 更新客户端......................................................................................................................................................................... 5288.3 访问 MRS 集群上托管的开源组件 Web 页面............................................................................................................... 5328.3.1 开源组件 Web 站点..........................................................................................................................................................5328.3.2 开源组件端口列表............................................................................................................................................................ 5348.3.3 通过弹性公网 IP 访问...................................................................................................................................................... 5458.3.4 通过 Windows 弹性云服务器访问............................................................................................................................... 5488.3.5 创建连接 MRS 集群的 SSH 隧道并配置浏览器........................................................................................................ 549

    9 MRS Manager 操作指导...................................................................................................... 5539.1 MRS Manager 简介............................................................................................................................................................ 5539.2 访问 MRS Manager............................................................................................................................................................ 5569.3 访问支持 Kerberos 认证的 Manager..............................................................................................................................5609.4 查看集群运行任务............................................................................................................................................................... 5669.5 监控管理................................................................................................................................................................................. 5669.5.1 系统概览............................................................................................................................................................................. 5669.5.2 管理服务和主机监控........................................................................................................................................................5689.5.3 管理资源分布.....................................................................................................................................................................5729.5.4 配置监控指标转储............................................................................................................................................................ 5739.6 告警管理................................................................................................................................................................................. 5749.6.1 查看与手动清除告警........................................................................................................................................................5749.6.2 配置监控与告警阈值........................................................................................................................................................5759.6.3 配置 Syslog 北向参数...................................................................................................................................................... 5769.6.4 配置 SNMP 北向参数...................................................................................................................................................... 5799.7 告警参考................................................................................................................................................................................. 5809.7.1 ALM-12001 审计日志转储失败....................................................................................................................................5809.7.2 ALM-12002 HA 资源异常.............................................................................................................................................. 5829.7.3 ALM-12004 OLdap 资源异常....................................................................................................................................... 5849.7.4 ALM-12005 OKerberos 资源异常............................................................................................................................... 5859.7.5 ALM-12006 节点故障..................................................................................................................................................... 5869.7.6 ALM-12007 进程故障..................................................................................................................................................... 5889.7.7 ALM-12010 Manager 主备节点间心跳中断.............................................................................................................5909.7.8 ALM-12011 Manager 主备节点同步数据异常........................................................................................................ 5919.7.9 ALM-12012 NTP 服务异常............................................................................................................................................ 5929.7.10 ALM-12016 CPU 使用率超过阈值............................................................................................................................ 5959.7.11 ALM-12017 磁盘容量不足.......................................................................................................................................... 5969.7.12 ALM-12018 内存使用率超过阈值............................................................................................................................. 5989.7.13 ALM-12027 主机 PID 使用率超过阈值.................................................................................................................... 599

    MapReduce 服务用户指南 目 录

    文档版本 15 (2020-03-18) 版权所有 © 华为技术有限公司 ix

  • 9.7.14 ALM-12028 主机 D 状态进程数超过阈值............................................................................................................... 6019.7.15 ALM-12031 omm 用户或密码即将过期..................................................................................................................6029.7.16 ALM-12032 ommdba 用户或密码即将过期.......................................................................................................... 6049.7.17 ALM-12033 慢盘故障...................................................................................................................................................6059.7.18 ALM-12034 周期备份任务失败................................................................................................................................. 6069.7.19 ALM-12035 恢复失败后数据状态未知.................................................................................................................... 6079.7.20 ALM-12037 NTP 服务器异常..................................................................................................................................... 6089.7.21 ALM-12038 监控指标转储失败................................................................................................................................. 6109.7.22 ALM-12039 GaussDB 主备数据不同步................................................................................................................... 6129.7.23 ALM-12040 系统熵值不足.......................................................................................................................................... 6149.7.24 ALM-13000 ZooKeeper 服务不可用........................................................................................................................ 6159.7.25 ALM-13001 ZooKeeper 可用连接数不足............................................................................................................... 6189.7.26 ALM-13002 ZooKeeper 内存使用量超过阈值.......................................................................................................6209.7.27 ALM-14000 HDFS 服务不可用.................................................................................................................................. 6219.7.28 ALM-14001 HDFS 磁盘空间使用率超过阈值........................................................................................................ 6239.7.29 ALM-14002 DataNode 磁盘空间使用率超过阈值............................................................................................... 6249.7.30 ALM-14003 丢失的 HDFS 块数量超过阈值........................................................................................................... 6269.7.31 ALM-14004 损坏的 HDFS 块数量超过阈值........................................................................................................... 6279.7.32 ALM-14006 HDFS 文件数超过阈值..........................................................................................................................6289.7.33 ALM-14007 HDFS NameNode 内存使用率超过阈值......................................................................................... 6309.7.34 ALM-14008 HDFS DataNode 内存使用率超过阈值........................................................................................... 6319.7.35 ALM-14009 故障 DataNode 数量超过阈值........................................................................................................... 6329.7.36 ALM-14010 NameService 服务异常........................................................................................................................ 6349.7.37 ALM-14011 HDFS DataNode 数据目录配置不合理........................................................................................... 6379.7.38 ALM-14012 HDFS Journalnode 数据不同步.........................................................................................................6399.7.39 ALM-16000 连接到 HiveServer 的 session 数占最大允许数的百分比超过阈值.......................................... 6419.7.40 ALM-16001 Hive 数据仓库空间使用率超过阈值..................................................................................................6429.7.41 ALM-16002 Hive SQL 执行成功率低于阈值..........................................................................................................6449.7.42 ALM-16004 Hive 服务不可用.................................................................................................................................... 6469.7.43 ALM-18000 Yarn 服务不可用.................................................................................................................................... 6499.7.44 ALM-18002 NodeManager 心跳丢失..................................................................................................................... 6519.7.45 ALM-18003 NodeManager 不健康..........................................................................................................................6529.7.46 ALM-18006 执行 MapReduce 任务超时.................................................................................................................6539.7.47 ALM-19000 HBase 服务不可用................................................................................................................................. 6549.7.48 ALM-19006 HBase 容灾同步失败............................................................................................................................ 6569.7.49 ALM-25000 LdapServer 服务不可用....................................................................................................................... 6589.7.50 ALM-25004 LdapServer 数据同步异常................................................................................................................... 6609.7.51 ALM-25500 KrbServer 服务不可用.......................................................................................................................... 6629.7.52 ALM-27001 DBService 服务不可用..........................................................................................................................6639.7.53 ALM-27003 DBService 主备节点间心跳中断........................................................................................................ 6669.7.54 ALM-27004 DBService 主备数据不同步................................................................................................................. 6679.7.55 ALM-28001 Spark 服务不可用.................................................................................................................................. 669

    MapReduce 服务用户指南 目 录

    文档版本 15 (2020-03-18) 版权所有 © 华为技术有限公司 x

  • 9.7.56 ALM-26051 Storm 服务不可用................................................................................................................................. 6709.7.57 ALM-26052 Storm 服务可用 Supervisor 数量小于阈值..................................................................................... 6729.7.58 ALM-26053 Storm Slot 使用率超过阈值................................................................................................................6749.7.59 ALM-26054 Storm Nimbus 堆内存使用率超过阈值........................................................................................... 6759.7.60 ALM-38000 Kafka 服务不可用.................................................................................................................................. 6779.7.61 ALM-38001 Kafka 磁盘容量不足..............................................................................................................................6789.7.62 ALM-38002 Kafka 堆内存使用率超过阈值............................................................................................................ 6819.7.63 ALM-24000 Flume 服务不可用................................................................................................................................. 6829.7.64 ALM-24001 Flume Agent 异常................................................................................................................................. 6839.7.65 ALM-24003 Flume Client 连接中断.........................................................................................................................6859.7.66 ALM-24004 Flume 读取数据异常.............................................................................................................................6879.7.67 ALM-24005 Flume 传输数据异常.............................................................................................................................6899.7.68 ALM-12041 关键文件权限异常..................................................................................................................................6919.7.69 ALM-12042 关键文件配置异常................................................................................................................................. 6929.7.70 ALM-23001 Loader 服务不可用................................................................................................................................6939.7.71 ALM-12357 审计日志导出到 OBS 失败.................................................................................................................. 6969.7.72 ALM-12014 设备分区丢失.......................................................................................................................................... 6989.7.73 ALM-12015 设备分区文件系统只读.........................................................................................................................6999.7.74 ALM-12043 DNS 解析时长超过阈值....................................................................................................................... 7019.7.75 ALM-12045 网络读包丢包率超过阈值.................................................................................................................... 7039.7.76 ALM-12046 网络写包丢包率超过阈值.................................................................................................................... 7079.7.77 ALM-12047 网络读包错误率超过阈值.................................................................................................................... 7089.7.78 ALM-12048 网络写包错误率超过阈值.................................................................................................................... 7109.7.79 ALM-12049 网络读吞吐率超过阈值.........................................................................................................................7129.7.80 ALM-12050 网络写吞吐率超过阈值.........................................................................................................................7139.7.81 ALM-12051 磁盘 Inode 使用率超过阈值................................................................................................................7159.7.82 ALM-12052 TCP 临时端口使用率超过阈值............................................................................................................7179.7.83 ALM-12053 文件句柄使用率超过阈值.................................................................................................................... 7199.7.84 ALM-12054 证书文件失效.......................................................................................................................................... 7209.7.85 ALM-12055 证书文件即将过期................................................................................................................................. 7229.7.86 ALM-18008 Yarn ResourceManager 堆内存使用率超过阈值.......................................................................... 7259.7.87 ALM-18009 MapReduce JobHistoryServer 堆内存使用率超过阈值.............................................................. 7269.7.88 ALM-20002 Hue 服务不可用..................................................................................................................................... 7289.7.89 ALM-43001 Spark 服务不可用.................................................................................................................................. 7309.7.90 ALM-43006 JobHistory 进程堆内存使用超出阈值...............................................................................................7319.7.91 ALM-43007 JobHistory 进程非堆内存使用超出阈值.......................................................................................... 7339.7.92 ALM-43008 JobHistory 进程直接内存使用超出阈值.......................................................................................... 7349.7.93 ALM-43009 JobHistory GC 时间超出阈值............................................................................................................. 7359.7.94 ALM-43010 JDBCServer 进程堆内存使用超出阈值............................................................................................. 7379.7.95 ALM-43011 JDBCServer 进程非堆内存使用超出阈值.........................................................................................7389.7.96 ALM-43012 JDBCServer 进程直接内存使用超出阈值.........................................................................................7399.7.97 ALM-43013 JDBCServer GC 时间超出阈值........................................................................................................... 741

    MapReduce 服务用户指南 目 录

    文档版本 15 (2020-03-18) 版权所有 © 华为技术有限公司 xi

  • 9.7.98 ALM-44004 Presto Coordinator 资源组排队任务超过阈值.............................................................................. 7429.7.99 ALM-44005 Presto Coordinator 进程垃圾收集时间超出阈值..........................................................................7439.7.100 ALM-44006 Presto Worker 进程垃圾收集时间超出阈值................................................................................ 7449.7.101 ALM-18010 Yarn 任务挂起数超过阈值.................................................................................................................7469.7.102 ALM-18011 Yarn 任务挂起内存超过阈值............................................................................................................ 7479.7.103 ALM-18012 上个周期被终止的 Yarn 任务数超过阈值......................................................................................7499.7.104 ALM-18013 上个周期运行失败的 Yarn 任务数超过阈值................................................................................. 7499.7.105 ALM-16005 上个周期 Hive SQL 执行失败超过阈值......................................................................................... 7509.8 对象管理................................................................................................................................................................................. 7519.8.1 对象管理简介.....................................................................................................................................................................7519.8.2 查看配置............................................................................................................................................................................. 7529.8.3 管理服务操作.....................................................................................................................................................................7529.8.4 配置服务参数.....................................................................................................................................................................7539.8.5 配置服务自定义参数........................................................................................................................................................7549.8.6 同步服务配置.....................................................................................................................................................................7579.8.7 管理角色实例操作............................................................................................................................................................ 7579.8.8 配置角色实例参数............................................................................................................................................................ 7579.8.9 同步角色实例配置............................................................................................................................................................ 7599.8.10 退服和入服务角色实例................................................................................................................................................. 7599.8.11 管理主机操作.................................................................................................................................................................. 7609.8.12 隔离主机........................................................................................................................................................................... 7609.8.13 取消隔离主机.................................................................................................................................................................. 7619.8.14 启动及停止集群.............................................................................................................................................................. 7619.8.15 同步集群配置.................................................................................................................................................................. 7629.8.16 导出集群的配置数据..................................................................................................................................................... 7629.9 日志管理................................................................................................................................................................................. 7639.9.1 查看及导出审计日志........................................................................................................................................................7639.9.2 导出服务日志.....................................................................................................................................................................7649.9.3 配置审计日志导出参数................................................................................................................................................... 7659.10 健康检查管理......................................................................................................................................................................7669.10.1 执行健康检查.................................................................................................................................................................. 7669.10.2 查看并导出检查报告..................................................................................................................................................... 7679.10.3 配置健康检查报告保存数.............................................................................................................................................7689.10.4 管理健康检查报告..........................................................................................................................................................7689.10.5 DBService 健康检查指标项说明.................................................................................................................................7699.10.6 Flume 健康检查指标项说明........................................................................................................................................7699.10.7 HBase 健康检查指标项说明........................................................................................................................................7699.10.8 Host 健康检查指标项说明........................................................................................................................................... 7709.10.9 HDFS 健康检查指标项说明......................................................................................................................................... 7769.10.10 Hive 健康检查指标项说明......................................................................................................................................... 7769.10.11 Kafka 健康检查指标项说明...................................................................................................................................... 7779.10.12 KrbServer 健康检查指标项说明...............................................................................................................................778

    MapReduce 服务用户指南 目 录

    文档版本 15 (2020-03-18) 版权所有 © 华为技术有限公司 xii

  • 9.10.13 LdapServer 健康检查指标项说明............................................................................................................................ 7789.10.14 Loader 健康检查指标项说明.................................................................................................................................... 7799.10.15 MapReduce 健康检查指标项说明...........................................................................................................................7809.10.16 OMS 健康检查指标项说明........................................................................................................................................ 7819.10.17 Spark 健康检查指标项说明.......................................................................................................................................7849.10.18 Storm 健康检查指标项说明...................................................................................................................................... 7849.10.19 Yarn 健康检查指标项说明......................................................................................................................................... 7859.10.20 ZooKeeper 健康检查指标项说明.............................................................................................................................7859.11 静态服务池管理................................................................................................................................................................. 7869.11.1 查看静态服务池状态..................................................................................................................................................... 7869.11.2 配置静态服务池.............................................................................................................................................................. 7889.12 租户管理.............................................................................................................................................................................. 7909.12.1 租户简介........................................................................................................................................................................... 7909.12.2 添加租户........................................................................................................................................................................... 7919.12.3 添加子租户.......................................................................................................................................................................7939.12.4 删除租户........................................................................................................................................................................... 7959.12.5 管理租户目录.................................................................................................................................................................. 7969.12.6 恢复租户数据.................................................................................................................................................................. 7979.12.7 添加资源池.......................................................................................................................................................................7989.12.8 修改资源池.......................................................................................................................................................................7999.12.9 删除资源池.......................................................................................................................................................................7999.12.10 配置队列........................................................................................................................................................................ 8009.12.11 配置资源池的队列容量策略...................................................................................................................................... 8019.12.12 清除队列配置................................................................................................................................................................ 8019.13 备份与恢复.......................................................................................................................................................................... 8029.13.1 备份与恢复简介.............................................................................................................................................................. 8029.13.2 备份元数据.......................................................................................................................................................................8049.13.3 恢复元数据.......................................................................................................................................................................8059.13.4 修改备份任务.................................................................................................................................................................. 8079.13.5 查看备份恢复任务..........................................................................................................................................................8089.14 安全管理.............................................................................................................................................................................. 8099.14.1 未开启 Kerberos 认证集群中的默认用户清单........................................................................................................8099.14.2 修改操作系统用户密码................................................................................................................................................. 8129.14.3 修改 admin 密码............................................................................................................................................................ 8139.14.4 修改 Kerberos 管理员密码...........................................................................................................................................8149.14.5 修改 LDAP 管理员和 LDAP 用户密码....................................................................................................................... 8159.14.6 修改组件运行用户密码................................................................................................................................................. 8169.14.7 修改 OMS 数据库管理员密码..................................................................................................................................... 8179.14.8 修改 OMS 数据库数据访问用户密码........................................................................................................................ 8189.14.9 修改组件数据库用户密码.............................................................................................................................................8189.14.10 更换 HA 证书................................................................................................................................................................ 8199.14.11 更新集