Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
自然语言处理的进展
周明博士微软亚洲研究院副院长、国际计算语言学学会候
任会长2018年12月于重庆大学
5 Research Focus Areas
微软亚洲研究院
简介
Microsoft’s
fundamental research
arm in the Asia Pacific
region
Technologies transferred into
all major Microsoft products
Founded on
Nov. 5th, 1998
Papers published
4,000+
50+Best Paper
The World’sHottestComputer Lab
-- MIT TR
微软亚洲研究院主要研究方向
自然用户界面 多媒体机器学习 大数据
Self-driving
Personal assistant
Surveillance detection Translation Medical diagnostics Game
Art
Image recognition Speech recognition Natural language Generative model Reinforcement learning
人工智能的进展
大数据集
RDMA
计算能力
14M images
算法与框架
数据、算法、计算构成人工智能的三要素
微软人工智能在典型评测上达到人类水准
96%
物体识别88.5%
阅读理解
94.9%
语音识别69%
机器翻译
ImageNet Switchboard SQuAD WMT-2017
创造
智能
认知智能
语言、知识、推理
感知智能
听觉、视觉、触觉
运算智能
记忆、计算
自动以文字描述录像
A car is running A man is cutting
a piece of meatA man is performing
on a stage
A man is riding
a bike
A man is singing A panda is walking A woman is riding
a horse
A man is flying in a field
视觉 语言 搜索知识语音
微软认知服务:理解世界的智能API
利用计算机对人类语言进行处理、理解,使其具备人类的听说读写能力,是未来最为关键的核心技术之一。--比尔·盖茨
自然语言处理是人工智能皇冠上的明珠--比尔·盖茨
语言理解的难度
剩女和剩男产生的原因有两个:一是谁都看不上,二是谁都看不上。
语言智能(自然语言理解)
NLP 基础技术
词汇表示和词汇分析
短语表示和分析
句法语义表示和分析
篇章表示和分析
信息检索
NLP 核心技术
机器翻译
提问和回答
聊天和对话
知识工程
语言生成
NLP+
搜索引擎
智能客服
商业智能
语音助手
机器学习大数据用户画像
推荐系统信息抽取
领域知识云计算
重要且广泛的应用
NLP 的历史沿革• 1940 ~ 1954: 电子计算机发明,智能理论构建
• 代表人物:Chomsky,Backus,Weaver, Shannon
• 1954 ~ 1970:形式化规则系统,逻辑理论,感知机
• 代表人物:Minsky,Rosenblatt
• 1970 ~ 1980:基于HMM的语音识别,语义和篇章建模
• 代表人物:Frederick Jelinek,Martin Kay
• 1980 ~ 1991:大规模规则知识库构建
• 代表系统:WordNet (1985), HPSG (1987), CYC (1984)
• 1991 ~ 2008:统计建模和机器学习的广泛应用
• 代表方法:SVM, MaxEnt, PCFG, PageRank
• 典型应用:统计机器翻译,IBM Watson问答系统,互联网搜索
• 2008 ~ 2017:大数据和深度学习
• 代表技术:词嵌入,神经机器翻译,机器阅读,对话系统
Source : The Economist
what is namethe brotherJustin Bieber
nsubj
det
prep
nnnn
attr
of
pobj
root
WP VBZ NNDT NNNNP NNPIN
句法分析
语义分析
上下文无关的单轮分析
上下文有关的多轮分析
语义分析
Handspring's other board
members are Dubinsky and
chief product officer Jeff
Hawkins, both Handspring
co-founders; John Doerr,
general partner at Kleiner ,
Perkins, Caufield & Byers;
Bruce Dunlevie, managing
member with Benchmark
Capital; Mitchell Kertzman,
CEO of Liberate Inc.; and
Kim Clark, dean of Harvard
Business School.
IE
NAME TITLE ORGANIZATION
Dubinsky board member Handspring
Jeff Hawkins board member Handspring
John Doerr board member Handspring
Kim Clark board member Handspring
Dubinsky co-founder Handspring
Kim Clark dean Harvard…
…….
Person-Affiliation
信息抽取信息抽取
Adam Wang (Male)
XXXX Company of Bejing,Beijing City,
1000071364-110-XXX
[email protected] Education Background From Sept. 2000 to Apr. 2003, I got master degree from University of XXX in computer software engineering major. From Sept. 1996 to July. 2000, I got bachelor degree from School of XXX of Xi’an in computer science and technology major. Experience From March 2003 to now, Software Engineer, XXXX Company of Beijing From June 2001 to March 2003, Software Engineer, Research Center of XXX Company From Sept. 2000 to May 2001, Software Engineer, National Lab. Of XXX University
Interests Reading, music, and jogging
<Name>Adam Wang</Name>
<Gender>Male</Gender>
<Address>XXXX Company of Bejing, Beijing City</Address>
<ZipCode>100007</ZipCode>
<Mobile>1364-110-XXX</Mobile>
<Email>[email protected]</Email>
<GradSchool> University of XXX</GradSchool> <Major>Computer Software Engineering</Major><Degree>Master</Degree><GradSchool>School of XXX of Xi’an</GradSchool><Major>Computer Science and Technology</Major><Degree>Bachelor</Degree>
<Interests>Reading, music, and jogging</Interests>
<Experience>From March 2003 to now, Software Engineer, XXXX Company of Beijing From June 2001 to March 2003, Software Engineer, Research Center of XXX Company From Sept. 2000 to May 2001, Software Engineer, National Lab. Of XXX University</Experience>
Personal Information/
Personal detailed info
Education/
Educational detailed info
Research Experience
Interests
示例:简历信息抽取
知识图谱
• 实体知识图谱上的节点
• 谓词连接两个实体的边
• CVT (Compound Value Type)
并不是一个真实的实体节点,而是被用来搜集一个事件的多个属性
• 事实三元组,包括一个谓词及其连接着的两个实体。事件,通过一个CVT节点连接着一组多实体。
问答系统
Simple-Relation Question
Multi-Constraint Question
Multi-Hop Question
问答系统
Image/Video
Web DocumentWeb TableKnowledge Graph
Query
Response
Entity Table (Cell)Paragraph/Sentence
/Phrase
Human-in-the-Loop (HI)
KBQA TableQA DocQACommunityQA
Question Generation
Image/Video
VisionQA
Honolulu
Michelle Obama
Where did the president of the United States born?
S
set
A1
find(set, r1)
A4
find(set, r2)
A4
{e}
United States
A15
A16
placeOfBirth
A17
isPresidentOf
A17
A1: S → Set
A4: Set → find (set, r1)
A4: Set → find (set, r2)
A15: Set → {e}
A16: e → United States
A17: r2 → isPresidentOf
A17: r1 → placeOfBirth
𝑆
𝐴1
𝐴1
𝐴4
𝐴4
𝐴4
𝐴4
𝐴15
𝐴15
𝑒𝑈𝑆
𝑟𝑔𝑟𝑎𝑑
𝑒𝑛𝑑
𝑒𝑈𝑆
𝑟𝑝𝑟𝑒𝑠
𝑟𝑝𝑟𝑒𝑠
𝑟𝑔𝑟𝑎𝑑
Guo et al. Dialog-to-Action: Conversational Question Answering Over a Large-Scale Knowledge Base. NIPS, 2018.
Dialog Memory
Entity{United States, tag=utterance}
{New York City, tag=answer}
Predicate {isPresidentOf}
{placeOfBirth}
Action
Subsequence
𝑠𝑒𝑡 → 𝐴4 𝐴15 𝑒𝑈𝑆 𝑟𝑝𝑟𝑒𝑠𝑠𝑒𝑡 → 𝐴4 𝐴15𝑠𝑒𝑡 → 𝐴4 𝐴4 𝐴15 𝑒𝑈𝑆 𝑟𝑝𝑟𝑒𝑠 𝑟𝑏𝑡ℎ𝑠𝑒𝑡 → 𝐴4 𝐴4 𝐴15
Where did president of
the United States born?New York City
Where did he
graduate from?
𝑟𝑔𝑟𝑎𝑑
𝑟𝑔𝑟𝑎𝑑
𝑒𝑛𝑑
𝐴4 𝐴19 𝐴4 𝐴15
𝐴19
𝑒𝑈𝑆 𝑟𝑝𝑟𝑒𝑠
replicated action sequence w/ instantiation
Previous Question Previous Answer Current Question
S
setA1
find(set, r1)
A4
graduateFrom
A17
find(set, r2)
A4
{e}
United States
A15
A16
isPresidentOf
A17
𝑆
𝐴1
𝐴1
𝐴4
copy
Guo et al. Dialog-to-Action: Conversational Question Answering Over a Large-Scale Knowledge Base. NIPS, 2018.
单文档摘要 句子摘要
the Sri Lankan government on
Wednesday announced the closure of
government schools with immediate
effect as a military campaign against
Tamil separatists escalated in the north of
the country.
Sri Lanka closes schools as war
escalates
多文档摘要/新闻聚合
1. Qingyu Zhou, Nan Yang, Furu Wei, Shaohan Huang, Ming Zhou and Tiejun Zhao. Neural Document Summarization
by Jointly Learning to Score and Select Sentences. Proc. ACL 2018.
2. Qingyu Zhou, Nan Yang, Furu Wei and Ming Zhou. Selective Encoding for Abstractive Sentence Summarization.
Proc. ACL 2017.
3. Pengjie Ren, Zhumin Chen, Zhaochun Ren, Furu Wei, Jun Ma and Maarten de Rijke. Leveraging Contextual Sentence
Relations for Extractive Summarization Using a Neural Attention Model. Proc. SIGIR 2017.
North Korea denies that US sanctions drove its denuclearization
pledge — …
North Korea has denied claims that US-led sanctions are what
encouraged it to seek international peace talks, and experts say it
was arrogance to ever think that was the case. According to the
state-run news agency …
Story from: Business Insider| Reuters| Voice of America
自动文摘
事件对齐文档规划
和生成图片过滤
和选择
• 时间• 动作• 运动员
• 基于模板的方法• 基于神经网络的方法• 系统融合
• 图片过滤• 按时间轴对齐• 确定图片位置
……
结构化数据 非结构化数据 图片集 足球赛报
自动生成新闻报道
神经机器翻译
聊天机器人
阅读理解
创作
神经机器翻译
聊天机器人
阅读理解
创作
神经机器翻译(NMT)
𝒆=(Economic, growth, has, slowed, down, in, recent, years,.)
𝒇=(
)
经济, 发展, 变, 慢, 了, .近, 几年,
Encoder
Decoder
-0.2 0.9 -0.1 0.50.7 0.0 0.2
Decoder Recurrent Neural Network
Encoder Recurrent Neural Network
神经机器翻译
𝒛𝒊
𝒖𝒊Wo
rdSa
mp
leR
ecu
rren
tSt
ate
𝒇=(
)
𝒔𝒊Sou
rce
Stat
e
𝒘𝒊Sou
rce
Wo
rdD
eco
der
En
cod
er
Sutskever et al., NIPS, 2014
经济, 发展, 变, 慢, 了, .近, 几年,
(1)
(2)(3)
𝒆=(Economic, growth, has, slowed, down, in, recent, years,.)
注意力模型
⊕
𝒛𝒊
𝒖𝒊
𝒄𝒊
𝒉𝒋
Wo
rdSa
mp
leR
ecu
rren
tSt
ate
Inte
rnal
Se
man
tic
Sou
rce
Vec
tors
Attention Weight
En
cod
er
Atte
ntio
n
⨀
𝒇=(
)
Deco
der
近, 几年,
Left-to-Right
Right-to-Left
𝒆=(Economic, growth, has, slowed, down, in, recent, years,.)
Bahdanau et al., ICLR, 2015
注意力模型
Bahdanau et al., ICLR, 2015
𝒛𝒊
𝒖𝒊
𝒄𝒊
𝒉𝒋
Wo
rdSa
mp
leR
ecu
rren
tSt
ate
Inte
rnal
Se
man
tic
Sou
rce
Vec
tors
En
cod
er
Atte
ntio
n
𝒇=(
)
Deco
der
发展, 变, 慢, 了, .近, 几年,
⨀ Attention Weight ⊕
经济,
𝒆=(Economic, growth, has, slowed, down, in, recent, years,.)
Microsoft Confidential
•
•
•• 自注意力层• 前馈非线性层
•• 自注意力层• 注意力到源语言层• 前馈非线性层
•• 对不同位置的不同信息建模
Input Embedding + Positional
Encoding
Output Embedding + Positional
Encoding
Multi-HeadSelf-Attention +
Residual
Encoder-Decoder Attention +Residual
Multi-Head Self-Attention +
Residual
Feed Forward + Residual
Softmax
Feed Forward +Residual
Encoder N-layers Decoder N-layers
基于Transformer的翻译模型
Microsoft Confidential
非线性层
自注意力层
非线性层
残差
残差
残差
残差
自注意力层
最终隐状态
Transformer编码器(两层为例)
Microsoft Confidential
自注意力层
残差
非线性层
残差
残差
自注意力到源
源语言隐状态
Transformer解码器(单层为例)
反向翻译训练方法 (Sennrich, et al., ACL’2018)
𝑁𝑀𝑇0(𝑦 → 𝑥)
𝑁𝑀𝑇0 (𝑥 → 𝑦)
translate
𝑦 𝑡 , 𝑥′ 𝑡
𝑥 𝑛 , 𝑦 𝑛 𝑦 𝑡
双语数据 目标语言单语数据
Microsoft Confidential
对偶学习(He, et al, NIPS, 2016)
𝑥 𝑛 , 𝑦 𝑛
双语数据
𝑥 𝑠 𝑦 𝑡
源语言单语数据 目标语言单语数据
𝑁𝑀𝑇0 (𝑥 → 𝑦) 𝑁𝑀𝑇0(𝑦 → 𝑥)
train train
𝑥′ 𝑡𝑦′ 𝑠
𝑁𝑀𝑇0(𝑦 → 𝑥)
𝑥′ 𝑠
𝑁𝑀𝑇0 (𝑥 → 𝑦)
𝑦 𝑡
loss loss
半监督联合学习模型 (Zhang, et al., AAAI’2018)
𝑁𝑀𝑇0 (𝑥 → 𝑦) 𝑁𝑀𝑇0(𝑦 → 𝑥)
𝑁𝑀𝑇1(𝑦 → 𝑥)𝑁𝑀𝑇1 (𝑥 → 𝑦)
translatetranslate
𝑥 𝑠 , 𝑦′ 𝑡 𝑦 𝑡 , 𝑥′ 𝑡
𝑁𝑀𝑇2(𝑦 → 𝑥)𝑁𝑀𝑇2 (𝑥 → 𝑦)
𝑥 𝑠 , 𝑦′′ 𝑡 𝑦 𝑡 , 𝑥′′ 𝑡
translate translate
Iteration 0
Iteration 1
Iteration 2
𝑥 𝑛 , 𝑦 𝑛𝑥 𝑠 𝑦 𝑡
双语数据源语言单语数据 目标语言单语数据
Microsoft Confidential
推敲网络 (Xia, et al, NIPS, 2017)
𝑥 𝑛 , 𝑦′(𝑛)
translatetranslate
𝑥 𝑛 , 𝑦′(𝑛)
𝑥 𝑛 , 𝑦′′(𝑛) 𝑥 𝑛 , 𝑦′′(𝑛)
translate translate
Iteration 0
Iteration 1
Iteration 2
𝑥 𝑛 , 𝑦 𝑛𝑥 𝑛
双语数据双语中的源语言 双语中的源语言
𝑥 𝑛
𝑁𝑀𝑇0 (𝑥 → Ԧ𝑦) 𝑁𝑀𝑇0(𝑥 → ശ𝑦)
train train
𝑁𝑀𝑇1(𝑥 → ശ𝑦)𝑁𝑀𝑇1 (𝑥 → Ԧ𝑦)fine-tune fine-tune
𝑁𝑀𝑇2(𝑥 → ശ𝑦)𝑁𝑀𝑇2 (𝑥 → Ԧ𝑦)fine-tune fine-tune
双向翻译一致性解码(Zhang, et al, 2018)
Microsoft Confidential
率先在 WMT-2017 测试集达到人类水准
•
•
•
•
Achieving Human Parity on Automatic Chinese to English News Translation, Hany Hassan et al, https://arxiv.org/pdf/1803.05567.pdf
Microsoft Human-Parity MT systems
24.0
24.5
25.0
25.5
26.0
26.5
27.0
27.5
28.0
28.5
26.38(Sogou, Ensembel)
BLEU (%)
25.57(Back Translation)
24.2(Transformer Baseline)
26.51(Dual Learning)
27.71(Joint Training)
26.91(Agreement
Regularization)
28.46(System Combination)
27.40(Deliberation Nets)
重要技术
新闻句子翻译示例
Source input 他 的 职业 生涯 如 过 山 车 一般 。
NMT output It has been a rollercoaster ride .
Human reference His career is like a roller coaster.
Source input 有线索人士 请 拨打 旧金山 警察局 举报 电话 4 15- 575 - 44 44 。
NMT output For clues, call the San Francisco Police Department at 415-575 - 4444.
Human reference Anyone with information is asked to call the SFPD Tip Line at 415-575-4444 .
• Sampled from WMT2017 Chinese-English task
Source input 霍夫 施泰特尔 表示 : " 这将由检察官来确定 " 。
NMT output That 's what the prosecutor must determine , " said Hofstetter .
Human reference Mr Hoff Steitel said: "It will be up to the prosecutors to determine.
手语翻译(与中科院合作)
父母生了我们三个孩子
父母 下 子女 三
父母 生了 我们 三个 孩子
长辈
爸妈
晚辈
孩子
生
产
仨
三种
神经机器翻译
聊天机器人
阅读理解
创作
用户上下文 小冰上下文 用户当前输入
阅读
记忆
提炼
回复生成
用户画像
对话情感
Decoder
Attention Model
LSTM
Encoder
微软小冰:已登陆中日美印尼五国
2014 2015 2016 2017 2017
China
小冰Japan
りんなUS
Zo
India
Ruuh
Indonesia
Rinna
敦煌公众号客服系统(敦煌小冰)
Zhao Yan, et al, DocChat: An Information Retrieval Approach for Chatbot Engines Using Unstructured Documents, ACL 2016
神经机器翻译
聊天机器人
阅读理解
创作
机器阅读理解
Passage (P) Question (Q) Answer (A)+
Tesla later approached Morgan to ask for more funds to build a more powerful transmitter. When asked where all the money had gone, Tesla responded by saying that he was affected by thePanic of 1901, which he (Morgan) had caused. Morgan was shocked by the reminder of his part in the stock market crash and by Tesla’s breach of contract by asking for more funds.
P
Q A Panic of 1901On what did Tesla blame for the loss ofthe initial money?
Read a document (passage) and then answer questions about it
queryanswer
passage
Dataset # of <question, passage>
pairs
Training 87,599
Dev 10,570
Test (not available
to participants)
> 10K
ImageNet style competition for machine readingcomprehension
Best Resource Paper in EMNLP 2016
74.520
75.860
76.920
77.68877.845
78.70678.842 78.926
79.083
81.003
81.685
82.136
82.65082.440 82.5
MSRA
2016.12.6
MSRA
2017.1.20
MSRA
2017.3.7
MSRA
2017.7.2
iFLYTEK
2017.7.25
Salesforce
2017.8.16
Microsoft
Business AI
2017.9.20
MSRA
2017.10.13
iFLYTEK/HIT
2017.10.17
AI2
2017.11.17
MSRA
2017.11.21
MSRA
2017.12.18
MSRA
2018.1.3
Alibaba iDST
2018.1.5
iFLYTEK/HIT
2018.1.22
Human EM Performance: 82.304
Best System EM Scores on SQuAD Machine Reading Comprehension Dataset (Dec. 6, 2016-Jan. 26, 2018)
Surpass Human EM [2018.1.3]
阅读理解技术的进步(精准回答)
神经机器翻译
聊天机器人
阅读理解
创作
我们的计算机创作之旅
对联 字谜古典诗词 歌词 谱曲/音乐
2005 2012 2014 2016/2017 2016/2017
现代诗
2016/2017
对联问题定义
FS: 海(hai) 阔(kuo) 凭(pin) 鱼(yu) 跃(yue)
sea wide allow fish jump
| | | | |
SS: 天(tian) 高(gao) 任(ren)
鸟(niao) 飞(fei)
sky high permit bird fly
风 (wind)----------------水 (water)
吹 (blow) ---------------使 (make)
荞(buckwheat) -- ------舟 (ship)
动(wave)----------------流 (go)
桥 (bridge) -------------洲 (island)
未 (not) -----------------不 (not)
动(wave) ---------------流(go)
Repetition of
pronunciations(音韵联)
有 (have)----------------- 缺 (lack)
子 (son) -------------------鱼 (fish)
有 (have) ------------------缺 (lack)
女 (daughter)-------------羊 (mutton)
方 (so) ---------------------敢 (dare)
称 (call) --------------------叫 (call)
好(good) -------------------鲜(fresh)
Decomposition of
characters (拆字联)
鲜鱼羊
好女子
板桥(Banqiao)---------------- 东坡 (Dongpo)
造(produce) -------------------居 (live)
桥(bridge) ---------------------坡 (mountain)
板(board)----------------------东(east)
Person
name
(人名联)
Palindrome
(回文联)
•Banqiao(板桥) and Dongpo(东坡) are famous litterateurs
•Reading from top to down is identical to down to top
Phrase-based log-linear model
SS output
Linguistic filters
FS input
N-best candidates
Ranking SVM model
天 高sky high
山hill
天sky
高high
深deep
任permit
倚depend
虫insect
鸟bird
虎tiger
飞fly
舞dance
鸣tweedle
鸟 飞bird fly
山 高hill high
海 阔 凭 鱼 跃Sea wide allow fish jump
虎 啸tiger roar
山高任鸟飞天高任鸟鸣天高任鸟飞山高靠虎啸山高任虎啸山深任鸟飞天高任花香
……
SMT decoding Reranking
天高任鸟飞山高任鸟飞天高任鸟鸣天高任鸟舞山深任鸟飞山高任花香天高任花香
……
山高任鸟飞天高任鸟鸣天高任鸟飞山深任鸟飞天高任花香天高任鸟舞山高任花香
……
Linguisticfiltering
特征
[ ] [ ] [ ] [ ]
[ ] [ ] [ ] [ ]
−
=
−
= +=+=
==1
1
1
1 11 )()(
),(log),()(
n
i
n
i
n
ij ji
jin
ij
ji
spsp
sspssISMI
http://duilian.msra.cn
http://video.sina.com.cn/v/b/10937201-1452530713.html
感归 春兴
从军北征 望洞庭
回忆 Memories
把你写过的日记 The diary you wrote
埋藏在我心底 Buried in my heart
写下我所有的记忆 Write down all my memories
把回忆留给自己 Leave the memories to myself
把你写在我心里 Write you up in my heart
写满了岁月的痕迹 Filled with the traces of years
不是因为我知道 Not because I know
让我想念你的微笑 Let me miss your smile
让我听见你的心跳 Let me hear your heartbeat
自从遇见你那一秒 Since the second I met you
《机智过人》是中央电视台综合频道与中国科学院联合主办的大型科学挑战节目。
国内首档聚焦智能科技的科学挑战类节目,是中国科学领域与传媒领域一次深入合作,更是全球顶尖人工智能研发精英和科技项目的巅峰盛典,标志着“科教兴国”战略的新高度。
未来研究方向
• 通过用户画像实现个性化服务
• 通过可解释的学习洞察人工智能机理
• 通过知识与深度学习的结合提升学习效率
• 通过迁移学习实现领域自适应
• 通过强化学习实现自我演化
• 通过无监督学习充分利用未标注数据
未来5-10年, NLP技术走向成熟
• 口语机器翻译完全普及
• 自然语言会话(聊天、问答、对话)达到实用
• 智能客服+人工客服完美结合大大提高效率
• 自动写诗、新闻、小说、流行歌曲流行起来
• 推动语音助手、物联网、智能硬件、智能家居的
普及
• 与其他AI技术一起在金融、法律、教育、医疗等
垂直领域得到广泛应用