五四献礼:只有第一,没有第二! 微软人工智能新技术“对话理解媲美人类” Microsoft最新NLP模型3项评分全面超越人类水平

网友投稿 2019-05-02 20:06

Machine Reading Systems NLP Are Becoming More Conversational 微软最新NLP模型3项评分全面超越人类水平前言：斯坦福大学组织2019人工智能会话问答理解挑战赛，微软亚研院NLP团队和微软Redmond语音对话团队在比赛中再创历史性的里程碑！根据CoQA排行榜，微软研究人员于2019年3月29日提交的集合系统达到了89.9 / 88.0 / 89.4，分别是域内，域外和整体F1得分。同一组会话问题和答案的人类表现为89.4 / 87.4 / 88.8 微软目前是唯一一个在模型性能方面达到人类水平的团队。这项成就标志着微软在Bing等搜索引擎和Cortana等智能助手与人们更自然地互动和提供信息方面取得了重大进展，这些互动更接近于真人之间的交流。 This achievement marks a major advance in the effort to have search engines such as Bing and intelligent assistants such as Cortana interact with people and provide information in more natural ways, much like how people communicate with each other. 微软部落格 May 30 2019 黄学东：热烈祝贺微软亚洲研究院和美国语音对话研究团队联合攻关在斯坦福大学对话理解挑战赛中达到了可以媲美人类的历史性里程碑！来自微软亚洲研究院（MSRA）的自然语言处理（NLP）小组和Microsoft Redmond的语音对话研究小组的研究小组目前正在斯坦福大学组织的会话问答理解（CoQA）挑战中处于领先地位。在这一挑战中，机器是通过理解文本段落和回答对话中出现的一系列相互关联的问题来衡量的。微软目前是唯一一个达到人类水平的团队。 A team of researchers from the Natural Language Processing (NLP) Group at Microsoft Research Asia (MSRA) and the Speech Dialog Research Group at Microsoft Redmond are currently leading in the Conversational Question Answering (CoQA) Challenge organized by Stanford University. In this challenge, machines are measured by their ability to understand a text passage and answer a series of interconnected questions that appear in a conversation. Microsoft is currently the only team to have reached human parity in its model performance. CoQA中的问题很短，模仿人类对话。此外，第一个问题之后的每个问题都取决于会话历史，这使得简短问题对于机器解析更加困难。例如，假设您问过一个系统，“谁是微软的创始人？”当您提出后续问题“他什么时候出生？”时，您需要了解您仍然在谈论同一主题。 The questions in CoQA are very short, to mimic human conversation. In addition, every question after the ﬁrst is dependent on the conversational history, which makes the short questions even more difficult for machines to parse. For example, suppose you asked a system, “Who is the founder of Microsoft?” You need it to understand that you were still speaking on the same subject when you ask the follow-up question, “When was he born?” 为了更好地测试现有模型的泛化能力，CoQA从七个不同的领域收集数据：儿童故事，文学，中学和高中英语考试，新闻，维基百科，Reddit和科学。前五个用于训练，开发和测试集，后两个仅用于测试集。 CoQA使用F1指标来评估性能。 F1度量衡量预测和地面实况答案之间的平均单词重叠。域内F1根据与训练集相同的域的测试数据进行评分;并且对来自不同域的测试数据评分域外F1。总体F1是整个测试集的最终得分。 CoQA数据集的一组对话，可以看到新问题与前面的问题之间的逻辑联系 To better test the generalization ability of existing models, CoQA collected data from seven different domains: children’s stories, literature, middle and high school English exams, news, Wikipedia, Reddit, and science. The first five are used in the training, development, and test sets, and the last two are used only for the test set. CoQA uses the F1 metric to evaluate performance. The F1 metric measures the average word overlap between the prediction and ground truth answers. In-domain F1 is scored on test data from the same domain as the training set; and out-of-domain F1 is scored on test data from different domains. Overall F1 is the final score on the whole test set. Microsoft研究人员使用的方法采用了一种特殊策略，将机器从几个相关任务中学习的信息用于改进目标机器阅读理解（MRC）任务。在这种多阶段，多任务，微调方法中，研究人员首先在多任务设置下从相关任务中学习MRC相关背景信息，然后在目标任务上微调模型。语言建模还在两个阶段中用作辅助任务，以帮助减少会话式问答模型的过度拟合。实验支持了这种方法的有效性，其在CoQA挑战赛中的强大表现进一步证明了这一点。多级多任务微调模型原理示意图 The method used by the Microsoft researchers employs a special strategy, in which information learned from several related tasks is used to improve the target machine reading comprehension (MRC) tasks. In this multistage, multitask, fine-tuning method, researchers first learn MRC-relevant background information from related tasks under a multitask setting, and then fine-tune the model on the target task. Language modeling is additionally used as an auxiliary task in both stages to help reduce the over-fitting of the conversational question-answering model. Experiments have supported the effectiveness of this method, which is further demonstrated by its strong performance in the CoQA Challenge. 根据CoQA排行榜，微软研究人员于2019年3月29日提交的集合系统达到了89.9 / 88.0 / 89.4，分别是域内，域外和整体F1得分。同一组会话问题和答案的人类表现为89.4 / 87.4 / 88.8。 According to the CoQA leaderboard, the ensemble system that Microsoft researchers submitted on March 29, 2019 reached 89.9/88.0/89.4 as its respective in-domain, out-of-domain, and overall F1 scores. Human performance on the same set of conversational questions and answers stands at 89.4/87.4/88.8. 这项成就标志着微软在Bing等搜索引擎和Cortana等智能助手与人们更自然地互动和提供信息方面取得了重大进展，这些互动更接近于真人之间的交流。尽管如此，一般的机器阅读理解和问答仍然是自然语言处理中尚未解决的问题。为了进一步突破机器的能力界限，理解和生成自然语言，微软团队表示将继续致力于打造更强大的预训练模型。 CoQA is a large-scale conversational question-answering dataset that is made up of conversational questions on a set of articles from different domains. The MSRA NLP team previously reached the human parity milestone on single-round question answering using the Stanford Question Answering Dataset (SQuAD). Compared with SQuAD, the questions in CoQA are more conversational and the answers can be free-form text to ensure the naturalness of answers in a conversation. Nonetheless, general machine reading comprehension and question answering remains an unsolved problem in natural language processing. To further push the boundary of machine capability in understanding and generating natural language, the team continues to work on producing even more powerful pre-training models. 人物介绍 Xuedong Huang 黄学东微软云和人工智能全球资深技术院士/Technical Fellow，微软语音和语言副总裁，微软首席语音科学家。IEEE&ACM院士。多年来领导包括微软在美国、中国、德国、以色列等地的全球团队负责研发微软企业人工智能认知服务等最新人工智能产品和技术。作为微软首席语音科学家他领导的语音和对话研究团队曾在 2016 年取得了历史性的可以和人媲美的语音识别里程碑。 1993年加盟微软之前他在卡内基-梅隆大学计算机学院工作。他荣获了1992年艾伦纽厄尔研究卓越领导奖、1993年IEEE 最佳论文奖、2011年全美亚裔年度工程师奖。2016年Wired 杂志评选他为全球创造未来商业的25位大牛天才之一。亦已获IEEE和ACM院士等殊荣。他在爱丁堡大学、清华大学、湖南大学分别获得博士、硕士、学士学位。

--end--

声明：本文章由网友投稿作为教育分享用途，如有侵权原作者可通过邮件及时和我们联系删除：freemanzk@qq.com