Conversational agent as kitchen assistant1220102/FULLTEXT01.pdf · Chatbots, also called conversational agents, with speech interfaces are being used to a greater and greater extent,

INOM EXAMENSARBETE TECHNOLOGY,GRUNDNIVÅ, 15 HP

, STOCKHOLM SVERIGE 2018

Conversational agent as kitchen assistant

BEATA RYSTEDT

MIA ZDYBEK

KTHSCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

INOM EXAMENSARBETE TEKNIK,GRUNDNIVÅ, 15 HP

, STOCKHOLM SVERIGE 2018

Chatbot med konverationsgränssnitt som hjälpreda i köket.

BEATA RYSTEDT

MIA ZDYBEK

KTHSKOLAN FÖR ELEKTROTEKNIK OCH DATAVETENSKAP

AbstractChatbots, also called conversational agents, with speech interfaces are being used toa greater and greater extent, but there are still many areas that are not completelyexplored. The idea of this project was born out of the belief that there is a need for anassistant in the kitchen that is able to search for recipes, answer questions regardingthem and guide and assist the user throughout the cooking process, all throughconversation since the hands are busy. This paper begins with an introduction inthe subject of conversational agents and the related technology, then similar, alreadyexisting studies and methods are presented with their pros and cons. After follows anin-depth explanation on how the program was constructed into a working kitchenassistant. Lastly, the users’ experiences of the performance and usability of theprogram was evaluated through tests and discussed. It turns out that conversationalagents definitely can be integrated in the kitchen, and according to several sources, ina few years they will be implemented in all possible areas and change the technologyof our time.

1

SammanfattningKonversationsrobotar med talgranssnitt anvands i allt storre och storre utstrackningmen det finns fortfarande manga omraden som inte ar helt utforskade. Iden till dethar arbetet foddes ur uppfattningen att det existerar ett behov av en hjalpreda tillkoket som kan soka recept, svara pa fragor kring receptet och vagleda och hjalpaanvandaren genom hela matlagningsprocessen i muntligt form eftersom handerna arupptagna med annat. Det har arbetet borjar med en introduktion i amnet kringkonversationsrobotar och tekniken bakom, sedan presenteras liknande arbeten ochmetoder som redan existerar inom omradet. Sedan foljer en djupdykning i hur detframtagna programmet i detta arbete utvecklats fram till en fungerande matlagn-ingsassisten. Till slut presenteras och diskuteras upplevelsen och anvandbarhetenav konversationsroboten hos manniskor baserat pa tester som gjorts. Det visar sigatt konversationsrobotar mycket val kan vara av anvandning i koket, och enligt flerakallor kommer de att inom en snar framtid lavinartat implementeras i alla mojligaomraden och forandra tekniken i vart samhalle.

2

CONTENTS

Contents1 Introduction 4

1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.1.1 Conversational agents . . . . . . . . . . . . . . . . . . . . . . 41.1.2 Natural Language Processing . . . . . . . . . . . . . . . . . . 41.1.3 Speech interface . . . . . . . . . . . . . . . . . . . . . . . . . . 41.1.4 Recipes and cooking . . . . . . . . . . . . . . . . . . . . . . . 5

1.2 Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.2.1 Problem situation . . . . . . . . . . . . . . . . . . . . . . . . . 51.2.2 Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Related works & tools 72.1 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.1 Similar programs . . . . . . . . . . . . . . . . . . . . . . . . . 72.1.2 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3 System 163.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.2 Dialogflow platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.2.1 Created intents . . . . . . . . . . . . . . . . . . . . . . . . . . 173.2.2 Created entities . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.3 Recipe class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.4 Speech interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.5 Natural language processing . . . . . . . . . . . . . . . . . . . . . . . 193.6 Web search & web scraping . . . . . . . . . . . . . . . . . . . . . . . 193.7 Operate in recipe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4 Evaluation 214.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

5 Discussion 255.1 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255.2 System limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

5.2.1 Recipe scraping and selecting recipes . . . . . . . . . . . . . . 275.2.2 Dialogflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

5.3 Improvements & comparisons . . . . . . . . . . . . . . . . . . . . . . 285.4 Ethics and sustainability . . . . . . . . . . . . . . . . . . . . . . . . . 28

6 Conclusion 30

3

1 IntroductionIn this chapter, background about relevant technologies and background aboutrecipes and cooking are presented. Then, the problem situation from which theidea of the project emerged from is discussed, and finally the goal and purpose ofthe project is presented.

1.1 Background1.1.1 Conversational agents

Conversational agents, also called or chatbots, combine conversational interface tech-nology and Natural Language Processing, and sometimes also web based services,to deliver interactive speech- or text-based dialogs [12]. The accessibility and use-fulness of conversational agents are constantly increasing and applied in more andmore areas of peoples everyday lives. Conversational agents from major companieslike Apple with Siri, Amazon with Alexa and Google with Google Assistant havemany functions such as home automation, music control, weather check, orderingfood, reading the news, searching the web and more. Developers are constantlyevolving these complex agents and more and more ”skills” are added. In additionto these multi-use agents, there are assistants that specializes on single tasks. Oneexample of a platform that can be used to develop such specialized agents is Google’sDialogflow (previously api.ai), that let’s users compose their own agents and thenintegrate it into their own app, website or platform.

Chatsbots can have no specific skills besides conversating, like a conversationalchatbot, or specializations in one field for example controlling an audio player orkitchen appliances. They can also combine skills to make all round conversationalassistants. A conversational bot’s purpose is to entertain the user and simulate aconversation without a specific goal. A task-oriented bot’s purpose is to interpretthe request and perform the task the user is asking for. Most task-oriented chatbotsalso have some conversational skills. Chatbots of varying degree of intelligence haveexisted for years. The first chatbot appeared in the 60’s and used simple keywordmatching to generate an answer [20]. Chatbots today have evolved and are usingtechnologies such as pattern matching, keyword extraction, NLP, machine learning,and deep learning.

1.1.2 Natural Language Processing

A conversational agent uses Natural Language Processing, NLP, to perform inter-active dialogs with a user. NLP uses computer and information sciences, linguis-tics, mathematics, electrical and electronic engineering, artificial intelligence androbotics, psychology, and other areas to explore how computers can be used to un-derstand and manipulate natural language text or speech, by gathering informationabout how human beings understand and use language [6]. The aim of NLP researchis to develop appropriate tools and techniques to make computer systems able tounderstand and manipulate natural languages to perform the desired tasks.

1.1.3 Speech interface

A speech interface allows the user to, instead of with a mouse, keyboard or sim-ilar physical objects, use speech and hearing to interact with technology. Speech

4

1.2 Project

interfaces consists of two main parts: speech recognition, where an acoustic signalis transformed into textual words, i.e. the user’s speech is being recognized by thecomputer, and speech synthesis, which transforms text into speech. Even though aspeech interface can transform speech to text and text to speech, it cannot in itselfunderstand what the user is saying.

Speaking is a highly natural process for humans, while typing on a keyboardor clicking a mouse are not. Speech interfaces in technology have therefore beena hot topic for over 20, even up to 40, years [11]. People talked about how iswould change the way we interact with our technology; our telephones, computersand other connected devices. Through the two parts of speech interfaces; speechrecognition and speech synthesis, users would be free of the constraints of pointers,keyboards and even screens and instead be able to use their voice to control thedevices. However, the technology was for a long time not evolved enough to providea pleasant interactive experience for the user; the synthesized speech was unpleasantto listen too, and the recognition did not properly understand acoustic input. Withtime, the technology improved and today speech interfaces are widely spread andintegrated in a variety of devices and programs, including but not limited to smart-phones, computers, smart-watches and speakers.

1.1.4 Recipes and cooking

With the first recipes written on cave walls, followed by stone tablets and parch-ment rolls, today’s collections of recipes are quite different [15]. Cookbooks havebeen around for about 2000 years with the first collection of recipes written downin a cook by Marcus Gavius Apicius. Even though web based recipe sources are be-coming increasingly more popular, cookbooks are still massively relevant as the mostpopular cookbook in the United States in 2016 was sold in over 400 000 copies[18].However, the accessibility and cheapness of recipes on websites and apps is not eas-ily overlooked. Most big grocery stores have both apps and websites where theypresent recipes, and many recipe websites make sure to also have an app-versionof their recipe database. When looking for inspiration and related works for thisproject, a few apps with audio-functions was even discovered. Mondelez SverigeAB (Philadelphia) has made an app called ”Voice Cooking App” where users cansearch for recipes with audio commands and use their voice to get guided throughthe recipe, but that doesn’t read anything out loud itself. An app called SideChefwas also found which reads you the instructions of a recipe and listens to audio toknow when to proceed to the next step of the recipe. These apps are both latercompared to the project program to deepen the understanding of conversationalinterfaces integrated into the area of cooking.

1.2 ProjectThe purpose of this project is to evaluate how conversational agents can be usedto assist in cooking and recipe searches by creating and evaluation a conversationalagent with a speech interface for said purpose.

1.2.1 Problem situation

Modern conversational agents are wide spread and include a wide variety of func-tions, but some areas are still unexplored. The most advanced conversational agents

5

1.2 Project

can order in food for you and search the web for recipes, but other than that theyare not particularly useful when it comes to food and cooking. Imagine you are inthe kitchen cooking from a recipe on your phone or tablet and your hands are dirty.You need to know the next instruction for the recipe, how much of an ingredientto use or how long you should knead the dough, but the screen is locked on yourdevice. You have no choice but to wash and dry your hands or just make somethingup and hope it turns out well. Now imagine you are walking home from school orwork and are wondering what to cook for dinner. It is freezing outside so you’drather not take up your phone and take off your gloves. You could use your phone’sbuild in conversational assistant to help you search the web for recipes, but you’dstill need to take out your phone to see the results and choose a recipe. These twoproblem situations lead to the decision to develop a conversational assistant thatcompletely through conversation makes it able search for and select a recipe and getinformation about the recipe in all stages of the cooking process.

1.2.2 Goal

The goal of the project is to develop a conversational assistant that can eliminateall need for other interfaces and guide the user through all stages of recipe searchingand cooking, and then to evaluate this use of the conversational agent technology.The assistant should have functions such as saving the ingredients of a recipe as agrocery list on the device, going back and fourth through a recipe and providinginformation about amount and instructions for an ingredients. The evaluation willthen be used to determine how people think conversational agent technology can beuseful and helpful in the intended areas, if the developed agent works as expected,and explore possible improvements and extensions of the agent.

6

2 Related works & toolsIn this section, similar programs to the project program will be discussed, as wellas similar works about conversational agents and the opportunities and possibilitiesof conversational agents. Then, tools used in the project will be presented.

2.1 Related worksConversational agents are all around and can be built in a variety of different ways.Here, two cooking apps with conversational and visual interfaces will be presentedand their features will be discussed. Then, similar works about conversational agentsand different ways of implementing a conversational agent are presented.

2.1.1 Similar programs

When researching for this project two similar applications to the project were found.They are both recipe and/or cooking apps with conversational interfaces of differ-ent complexity, and where used for inspiration for some of the features of the project.

Voice Cooking App is an application that can be downloaded to smartphoneswhere users can search for recipes with audio commands and use their voice toget guided through the recipe. These features are similar to what is attempted inthis project, but has a much simpler conversational interface and Natural LanguageProcessing usage in the way that it only responds well to one-word commands. Forexample, when a user wants to search for ”Cheesecake”, he/she just says ”cheese-cake” when on the search page, or when they want to go to the next step, they haveto be ”in” the recipe and then say ”next”. This makes the app different from thisproject in the way that it has to be used together with its visual interface and theinteraction with the app is less natural and conversational. The Voice Cooking Appalso does not read anything out loud; the user reads the instructions and ingredientson the screen of their smartphone, with of course also requires the visual interface.A visual interface is not a bad thing and can be useful for increasing understandingof the instructions, or provide an alternate source of information if the user doesn’tunderstand an ingredient or what a certain kitchen gadget is. However, a visualinterface can be limiting in some cases, for example the case when a user wants tosearch for a recipe without having the phone out of his or her pocket because ofcold weather. An idea that came from examining this app was to have an visualinterface, but only as addition to, and not instead of, the functions of the projectprogram. This way the user can choose to or not to use the visual interface and hasthe option to just use the program with the conversational interface. Other featuresthat the Voice Cooking App has is integrating a timer into instructions that requireskeeping track of time, and functions for making notes on the recipe and charing itwith other users.

The other app that was found is called SideChef. This app combines the con-versational and visual interfaces as discussed above by reading the instructions ofa recipe, while showing a picture related to the instruction, and listening to audioto know when to proceed to the next step of the recipe. Similar to Voice CookingApp, this app seems to only accepts short voice commands like ”next” to show thenext instructions. It does not use voice control for recipe searches or other functionsthan listening for when to go to the next step. To only have functions for short

7

2.1 Related works

commands could be considered ”simple” and ”boring”, but it comes with one clearadvantage: easy interpretation. If the user knows they’re expected to just say next,and the program only accepts ”next” and nothing else, NLP is not necessary, andthe disadvantages and problems that can come with NLP disappears. Also similar toVoice Cooking App, the SideChef app has functions like keeping track of time, ratingrecipes and sharing photos and recipes with friends and other users. To have a rat-ing function could be considered useful because users get information about whatprevious users thought of that recipe. To integrate a timer into the instructionswhen it is relevant also eliminates the need for the user to have to switch programsin order to set a timer. Even though most smartphones have built in conversationalagents that can set a timer, it can cause problems if apps are switched automatically.

In conclusion, the lessons learned from Voice Cooking App and SideChef are thatit could be a good idea to integrate a visual interface into the program, to integrate atimer into relevant instructions, and to have functions for rating and sharing recipes.This will not be pursued in this project, but could be part of future improvements.

2.1.2 Related works

Conversational agents can be implemented in several ways. This section will providean insight in some similar works and methods, and also an idea of the existingpossibilities regarding conversational agents.

Speech recognition with deep learning

A paper that is related to this thesis project is about speech recognition implementedwith deep learning [7]. Since the conversational agent created in this project usesspeech recognition it is of relevance to have an insight in how the realization of itcan vary. The thesis of the mentioned paper was to research how speech input canbe translated into text format with deep learning.

Deep learning is a way of establishing rules for processing data with help ofmultiple layers of non-linear information, which are varying layers of abstraction.The definitions are created in the bottom level and then, with help of hidden layers,an output is generated as visualized in Figure 1 below. One example is that theinput is a vector and the different layers are operators processing the vector intoan output of desired form. Deep learning connections, or neural connections, andtheir algorithms can be structured in many ways, for example as recursive neuralnetworks, convolutional neural networks and deep belief networks.

8

2.1 Related works

Figure 1: Visualization of Deep Learning [2]

The techniques of implementing deep learning are divided in two parts; super-vised and unsupervised. Supervised deep learning is based on training examples ofinput and output layer pairs where the category of the input is known. Unsupervisedtraining on the other hand, is training where the input is not labeled, and so thereis not possible way of evaluating the accuracy of the algorithm. The advantage isthat the classification does not have to be done in advance and therefore saves time.Usually speech recognition is implemented with a combination of these two meth-ods, first with unsupervised pre-training, increasing the efficiency of the supervisedtraining by initializing the weights of the networks – pattern analysis. The createdspeech recognition in the study was very sensitive to noise in the input and did notperform as desired.

Conversational Agents Implemented with Neural Networks

One similar project that was encountered was a chatbot created with help of a neuralnetwork-based model, which is a way of implementing deep learning that is explainedabove [17]. It means that the system is teaching and learning by itself, trying toresemble biological neural networks such as the ones inside the human brain. Inother words the model’s training is based directly and only on the conversationaldata. As a consequence, neural networks require a lot of conversational trainingbefore the actual task solving phase can start. On the other hand, the method doesnot entail any purposed focused pre settings, meaning once the model exists it canbe used for various areas.

Figure 2: Neural networks [19]

9

2.1 Related works

In the study, two types of conversational agent were discussed – divided intoretrieving and generating conversational agents. Retrieving agents are controlled bymachine learning with databases as foundation, this is the kind of agent created inthis project. It means that the NLP addresses to the invocation depending on anexisting set of information. The clear benefit is that it is easier for the programto determine the context and take the relevant action, but the downside is that itrequires a big amount of work for constructing the database, and that there is noway to assure that the program won’t stumble upon something that does not fit inthe database and therefore can’t be interpreted by it.

For the generating agents, such as the one evaluated in the study, the answersand suitable actions are governed by earlier conversations, and generated completelybased on the user’s input. The advantage of this is that all kinds of inputs can behandled by the program and with much training presumably make the program ef-ficient, but the downside is that there is a big chance the chatbot will offer wrong,irrelevant or incomprehensible support.

The conversational agent in the study was trained to assist the user in bookinga restaurant by sequence-to-sequence training, in this case learning how to mapbetween input and appropriate answer. Some problems with such systems are thatthey have troubles with handling diversity of utterances, they can tend to displaycommonly used phrases of the training like “yes”, “no” and “thanks” to maximizethe probability of success even though it is not related to the conversation and also,consistency of the agents output is unstable due to inputs from different users andvariation in language. The produced agent had trouble with displaying options andprovide extra information, with 0 % per conversation accuracy. This indicates thatneural networks such as the one created in this study requires a big amount ofdata to work properly. Some improvements were presented in forms of displayinginformation of how sure the system was about the answer and training the programto detect when it is out of context.

Sentiment based Chatbots

Another way to create a conversational agent is by constructing it based on senti-ment, in other words the agent’s reply is determined depending on the attitude ofthe input [4]. It could be the feeling of the user’s input, the emotional reaction theuser wanted to provoke by the input or the user’s evaluation of something.

One paper regarding this was done on short texts found on Twitter, and simpli-fied to association on only negative or positive sentiment. The Natural LanguageProcessing part in the paper was implemented with help of Python package NLTKwhich uses a statistical method named Naive Bayes classifying, somewhat similar tothe method in the section above. The big difference is that the learning algorithmis combined with a big set of data. The advantage of using statistical methods com-pared to decision trees and such is that the program will be better at guessing theappropriate answer instead of being clueless if the specific situation not presentedbefore.

In the mentioned paper, the method was to filter text on certain keywords whichexpressed positive or negative emotions, such as amazed, angry, lucky, excited and

10

2.1 Related works

disappointed. Then the agent was trained on a set of a manually annotated datasince it must base the sentiment association on something. And thereafter thetraining set with associated sentiments was sent to a classifier that with used theinformation to determine the probability of a certain sentiment based on the con-taining keywords. With help of the generated statistic model the program lastlydetermined the probable category for a freshly selected tweet.

Sentiment based agents seem proper to use when the only purpose is to speakcasually on a very simple level and when the goal is to get a proper reaction basedon the emotional state of the communication. But when upholding more complexedconversations or task focused dialogues they do not appear appropriate. By usingmore keywords and maybe by creating a more context-based dialogue with help ofexpressions the accuracy of the association would be increased and then maybe theycould be of use in this project for the part responsible for small talk.

Chatbot trained on movie dialogue

The large set of data that a data-based conversational agent uses can be generatedin several ways; it does not have to be annotated manually. In one project similarthe author used movie subtitles to train the chat bot [14]. The results were notso promising and the responses generated by the conversational agent were moreoften out of context than not. This can perhaps be explained by the abrupt waysof transition in movie dialogues, the changes between scenes and maybe that thedatabase was not big enough. One other disadvantage is that the type of movieselected is crucial for how the agent becomes. The target group, genre of the movieand other things regarding it will decide the behavior of the agent. But the ideaof using movie dialogues was very interesting since it is a possible replacementfor manually generating databases and therefore increases the automatization ofcreating conversational agents.

Conversational Agents in the Enterprise

One other study treated the subject of conversational agents used in enterprises [12].Firstly the various possible areas use for conversational agents were discussed andthereafter more specific about how it could be applied in the enterprise section andwhat difficulties then could be encountered. Mostly it was concluded that conversa-tional agents can be helpful providers of information regarding companies’ productsand services. It also concluded that they could act as a cost-effective substituteor at least complement to customer service, guided selling, website navigation andtechnical support. It was believed and predicted that the use of conversational agentwould increase drastically over the years. One other paper [21] even claims that theconversational interface will substitute the need of programming by hand one day.

Summary

In summary neural network based agents, compare to those founded with a database,can be applied more generally and have a good capacity of handling unknown inputsbut unfortunately have the tendency of not being able to fulfill its purpose if nottrained well. Furthermore a sentiment based agent is not enough to cover the goal ofthis project but can be implemented as a part of is, and if developed further, coverthe casual conversation part of the program. It is also noted that conversational

11

2.2 Tools

agents have a broad spectrum of utility, nonetheless inside the enterprise sector,and are predicted to be of bigger and bigger importance in the society, and one dayeven substitute the need of programming by hand.

The best approach for this thesis project’s purpose seems to be implementinga conversational agent that is trained with a big set of data and thereafter usesmachine learning and statistics. Another conclusion that can be made is that theproject is of great relevance since the applications and use of conversational agentsare expanding drastically.

2.2 ToolsIn this section, information about the data, tools and libraries used for the projectis presented. A library, sometimes referred to as module, is a collection or pre-configured selection of routines, functions, and operations that a program can use.In this report it is assumed that the reader is as familiar with coding language as apeer student on the verge of taking the Bachelor’s Degree in Engineering Physics.Therefore the concepts of classes, objects, if statements and while loops will not beexplained further.

Python

Python is the programming language chosen for this project and there are severalreasons for that. For starters Python is easy both to learn and understand, it hasa really applicable standard library with simple but efficient built in functions andexpressions, and it is supported by many systems and platforms which makes itsuitable for integration of different sources. One more benefit with Python is thatis is so commonly used and for such many different purposes it has a broad amountof recourses, both in form of support for problem-solving in the coding, of SDKs forplatforms and of applicable libraries that are more explicitly described below.

Scraping the recipes, part 1: Requests

Requests is a Python module that allows users to send HTTP/1.1 requests with-out having to add query strings to URLs, or to form-encode POST data [13]. Itallows users to add content like headers, form data, multipart files, and param-eters via Python libraries and can likewise allow access to the response data ofPython. Requests has a long list of features, but is only used in this project toaccess the source code of a website. Compared to its alternatives, for exampleurllib and urllib2, Requests requires shorter code and is considered easier touse. Requests encodes the parameters automatically which allows the user to justpass them as simple arguments, unlike in the case of urllib, where you need touse the method urllib.encode() to encode the parameters before passing them.Requests is thread safe and automatically decodes the response into Unicode. In ad-dition, Requests has superior error handling. If the authentication failed, urllib2would raise a urllib2.URLError, while Requests would return a normal responseobject, as expected. To see if the request was successful when using Requests, theuser just has to check the boolean response.ok.

12

2.2 Tools

Scraping the recipes, part 2: lxml

The lxml XML toolkit is a Python module that works as a Pythonic binding forthe C libraries libxml2 and libxslt [16]. A binding from Python to a library iswhat you call an application programming interface (API) that provides glue codeto use that library in Python. It combines the speed and XML feature completenessof the libxml2 and libxslt libraries with the simplicity of a native Python API.It is the most feature-rich Python library for processing both XML and HTML.Other positives with this module is that it is very fast and memory efficient, andit works well in combination with Requests. The lxml package is in this projectused to scrape data from the source code of websites using the xPath-functions.An alternative to this module is Beautiful Soup, which name comes from ”tagsoup” and indicates that the module is specialized to handle invalid marking. It isa beginner friendly package that creates a parse tree that can be used to extractdata from HTML, and also automatically converts incoming documents to Unicodeand outgoing documents to UTF-8. It is slower but more flexible than lxml. lmxltherefore seems to be the right choice for when you know that the websites will havestraight forward formation and marking, and your program is simple enough to notneed the complexity or flexibility of Beautiful Soup.

Scraping the recipes, part 3: Recipe Scrapers

Recipe Scrapers is an open source Python module made by Hristo Harsev that worksas a simple web scraping tool for a variety of recipe sites[8]. It collects the title,cooking time, ingredients and instructions of a recipe. It uses urllib and BeautifulSoup to parse and scrape websites. Benefits from using this module is that it savedsome valuable time. It had many of the features that were wanted for the programand could therefore be used in a simple way without modification. However, if thealternative of building a similar program ourselves would have been chosen, maybewith the help of Requests and lxml, we could have had a better understandingof how the recipe scraping works and the project program would have been moreconsistent.

Recipe website: Bon Appetit

When looking for a website to use for the project, a few aspects were prioritized. Thewebsite should be in English, have a big and varied database of recipes, have a flexibleand well functioning search function, and preferably be on Recipe Scrapers’ list ofrecipe sites that they can scrape. The sites from Recipe Scrapers, which are all inEnglish, were therefore examined and after having some difficulty with the searchfunction of some of the other sites, including Jamie Oliver’s and BBC Food’s sites,Bon Appetit was chosen [5]. It is available at https://www.bonappetit.com/ andwas chosen mainly due to its surprisingly big database of recipes of all kinds andits superior user-friendly search function that handles misspellings and conjunctionswell. The Bon Appetit website originates from a food magazine with the same namepublished by Conde Nast. One clear disadvantage of Bon Appetit is that a few ofthe recipes pages have a different layout than the others, and Recipe Scrapers hasa hard time scraping all of the information. However, it seems like this is the casefor most modern sites due to campaigns etc. and this would probably be a problemwith most other sites as well.

13

2.2 Tools

Exchanging Data: json

Json, short for JavaScript Object Notation, is a format on files that is used fortransferring information in this project [1]. It constructs the information that isabout to be transferred into data objects with two main structures: pairs betweenattributes and values, and an ordered array data list. This means that all theinformation in the jsonfile can be reached with help of keywords and indeces, makingthe stored information truly easy accessible.

Natural Language Processing: Dialogflow

If a conversational agent is to be created it needs to have a natural language pro-cessing, NLP, part inside the program. This is the part that will interpret the inputfrom the user and convert this into useful information in forms of fulfillments. It isthe part that translates input to action, the part that ascertains what the informa-tion means and what to do with it, from natural user requests into actionable data.

There are many different ways to implement NLP, either by programming itfrom scratch or using already established databases and platforms. Since the timeresource for this project was limited and the main focus was not in the NLP creation,one online platform was chosen named Dialogflow. The main reason Dialogflow waschosen is because it is Google – based, and since Google is used in such big extentall over the world with so many applications and with good repute, Dialogflow waspresumed to be well developed. An other argument that speaks for Dialogflow isthat the information delivered by its NLP can be extracted as a json file which asmentioned has its advantages.

Dialogflow is truly a user friendly instrument for the developer, it uses only twoparameters for implementing the NLP; intents and entities [9]. Intents can be ex-plained as a group of expressions that are to be linked to the same action, or as theparameter that categorizes the input [10]. For example in this project one intentwas ‘Recipe Search Trigger’ which was activated by phrases such as “Can you helpme find a recipe?”, “Search for a recipe” and “I am hungry”.

A problem occurs if several intents have similar training phrases and their pur-pose is completely divided. Since the NLP has to interpret unknown expressions thedeviation between two intents can be hard. A logical consequence of this is that thedeveloper has to be very precise and defining on what kind of expressions shouldbe mapped to respective intent, keeping the training phrases unlike and separated.One offered solution, provided by the Dialogflow developers, is the possibility to pri-oritize intents, but this only works if some intents are preferred or more important.

Entity is the parameter that Dialogflow uses to sift out the useful informationof the input. One entity in this project was ‘ingredients’, which is an ingredient-dictionary with 1200 words and phrases. This means that every time the inputcontains a word or phrase that can be found in the dictionary or is likely to fitinside it, Dialogflow recognizes it as an ingredient and saves the value.

Dialogflow has several prebuilt agents that can be imported and used togetherwith the program. One easily activated is ‘Small talk’ which is a module for ca-sual conversation. Of course the small talk would be of higher quality if developed

14

2.2 Tools

manually and customized, but for this project the simple provided small talk wasgood enough. The small talk - module provides a feeling of a much more developedconversational agent that is more pleasant to interact with.

Text to speech: pyttsx3

Pyttsx3 [3] is a speech synthesis module for Python that includes drivers for text-to-speech synthesizers on OSX (NSSpeechSynthesizer), Windows (SAPI5) and Ubuntu(espeak). Pyttsx3 is then used to register and unregister event callbacks, produceand stop speech, get and set speech engine properties, and start and stop eventloops.

Speech recognition: SpeechRecognition

SpeechRecognition is a Python library for performing speech recognition, with sup-port for several engines and APIs, online and offline, including CMU Sphinx, GoogleSpeech Recognition, Google Cloud Speech API, Wit.ai, Microsoft Bing Voice Recog-nition, Houndify API, IBM Speech to Text and Snowboy Hotword Detection. Itsupports 120 languages, and defaults to the operating system’s language if nothingelse is stated.

15

3 SystemIn this chapter the program with its sections and functions is presented. The sub-sections of the chapter follow the sectioning of the code to give a clear view of howthe program was built. Each subsection gives information about functions in thatpart of the program, and what and how modules have been used.

3.1 OverviewThe main code of the program is parted in four sections: Speech interface, Naturallanguage processing, Web search & web scraping and lastly a section for operatingin recipes. The other part of the system is the online platform for natural languageprocessing mentioned earlier - Dialogflow.

The structure of a how a user’s input is handled and a response is triggered lookslike this:

Figure 3: Flowchart of the how the user’s input is handled in the program

1. The user speaks.

2. The speech interface part of the program translates the speech to text andforwards it to the NLP part.

3. The data is sent from NLP to Dialogflow, and is then interpreted and a jsonfile is generated.

4. The json file is extracted to the NLP part of program.

5. Depending on what the intent is activated in Dialogflow, the response fromDialogflow is either directly sent to the speech interface which performs a textto speech translation and reads the response to the user, or the response isfirst sent to one of the other two parts of the program; operating in recipes orrecipe search and scrape, and from there to the speech interface.

6. The program goes into an infinite loop, which can only be broken if thequitting-intent is activated in Dialogflow (not shown in flowchart).

16

3.2 Dialogflow platform

The flowchart in figure 4shows how the program’s functions are related to eachother and how a user can navigate through the program.

Figure 4: Model of the program’s functions’ relations

When the program is started a short greeting and introduction is read to theuser. Then the user can choose to either small talk with the program or to directlystart a recipe search by triggering the recipe search intent in Dialogflow. If smalltalk is started the user can choose to go to searching for a recipe at any point. Whena search for a recipe is initialized, a recipe must be selected or the search abortedbefore another function can be used. Only if a recipe is selected can the functions inthe lower part of the diagram be performed. If a new search is to be performed, anexisting recipe must first be canceled. This is to prevent the program from abortinga recipe as a result of a misunderstanding of the input. For example, if the user hasa recipe selected and says the name of an ingredient, the program might interpretthis as a request to make a new search for that recipe, and to make sure it doesn’tcancel against the user’s will, it asks “Do you really want to cancel the currentrecipe”. At any point of the recipe part of the program, marked green in the dia-gram, except for when selecting a recipe, the user can go back and forth to small talk.

Now follows a more in-depth explanation of the components of the system.

3.2 Dialogflow platformSince this conversational agents purpose is to assist in the kitchen, the NLP hadto be focused on recipe making, recipe recitation and somewhat on causal languageunderstanding.

3.2.1 Created intents

The intents used for recipe making are ’Asking for meal’, ’Trigger recipe making’and ’Asking for ingredients’. The first takes into account when the user expresses the

17

3.3 Recipe class

will to cook a meal, but not specifically which one. It has training phrases such as’I want a recipe for something to cook for dinner’ or ’I am in the mood for a snack’.The ’Trigger recipe making’ intent takes many different ways of expressing interestin making a recipe but without ingredients to search for, with training phrases suchas ’Can you help me find a recipe?’ and ’I am hungry’. The last intent of the recipemaking kind is the one triggered when the user asks for ingredients, the input canbe ’Can you search for pizza?’, ’Can you find me a recipe with arugula and fetacheese’ and ’Help me to cook something with tomatoes, beans and rice’.

The next section of intents is the one that is recipe recitation oriented. The goalwith the agent is not only to find a recipe but to guide the user through it and tobe able to understand different commands regarding operating in it. Intents usedfor this are ’Trigger specific recitation’ which is activated when the user asks forsomething specific regarding the recipe for example ’How hot should the oven be?’,’Trigger start recitation’ which activates during general recitation commands suchas ’Read me the recipe’ and ’Trigger step recitation’ which is called when the inputis step focused, with training phrases such as ’What step am I on?’, ’What is thenext step?’ and ’What is step number 5?’.

The last section of entities was developed because the agent had trouble withmixing together casual language and vital information. For example the word ’okay’was categorized as an ingredient. Therefore intents for confirmation, oppositionwhere created as well as an intent for presentation of the program that answersquestions such as ’What can you do?’ and ’How can you help me?’. The names ofthe intents of this section are ’Yes’, ’No’, ’Ok’ and ’Presentation of the program’.

3.2.2 Created entities

The entities of the program were as well as the intents chosen partly recipe fo-cused and partly to sift out unuseful information. The intents that carry importantparameter values regarding the recipe are ‘ingredients’, ‘kitchen tools’, ‘time’, ‘tem-perature’, ‘verbs’ and ‘courses’. Entities used for maneuvering in the recipe are‘before’, ‘now’, ‘next’ and ‘start over’ and the intents that are used to separate therecipe values from causal language are ‘filling words’, ‘no’ and ‘yes’.

3.3 Recipe classTo keep track of a current recipe, a Recipe class instance is initiated when a recipeis selected. A Recipe instance has fields for all relevant attributes of a recipe; title,cooking time, amount of servings, ingredients and instructions. Title, cooking time,ingredients and instructions are all set with the help of recipe scrapers when theinstance is initiated, and the amount of servings is found with the help of the lxmlmodule. This is because recipe scrapers did not have this function, and that theamount of servings was considered an important piece of information for a user toknow about the recipe.

3.4 Speech interfaceThe speech interface of the system has two parts, text-to-speech and speech-to-text.The text-to-speech part uses the pyttsx3 module that accesses the computer’s op-erating system’s text-to-speech synthesizer. OSX driven computers’ text-to-speech

18

3.5 Natural language processing

synthesizer is called NSSpeechSynthesizer and is the one used in the project.

The speech-to-text part uses the SpeechRecognition module. Here, SpeechRecog-nition is being used online with Google Speech Recognition that uses Google’s deeplearning neural network algorithms1. The recording works by waiting until the userhas started speaking, and then recording until it encounters a set amount of sec-onds of silence. If no audio could be recognized, either if the set amount of secondspassed without any sound or if the sound was unrecognizable, or if it failed to re-quest results from Google Speech Recognition service, the listening-process repeatsuntil successful.

3.5 Natural language processingOne part of the code is named NLP and has both some functions for understandingthe user and some that is more general. The conversating - function is the mostfrequently called function in the whole code and is the connection to the NLP plat-form.This function is responsible for sending in the input and extracting the jsonfile with all the useful information. One other NLP function, yes or no,is called inany situation where it is desired to ensure what the user wants.

The most important of the general functions in this section is trigger recipe,which is called when the ’Make recipe’ intent is activated. It has the task of findingout what kind of recipe the user want to search for, making sure the keywords for thesearch are correct and thereafter send them to the search recipe - function in theWeb search & web scraping section of the code. If a recipe is freshly selected it meansthat the step object has to be reseted and also a few alternatives for progression ofthe program has to be provided, which is the purpose of another function in thissection named select or not.

3.6 Web search & web scrapingThe program has a section of code called ”Web search and web scraping” that isresponsible for finding recipes for keywords provided by the NLP-section of the code,present these to the user, and get the user to select a recipe. When searching forrecipes, the program searches the Bon Appetit website for the keyword or keywordsidentified by Dialogflow, presents some of the found recipes to the user, and thenasks the user to either make a choice between one of the presented recipes, to askto hear more choices or to quit the search.

3.7 Operate in recipeAll the functions in this section are used to operate in the recipe once one is selected.They are called inside the main function, which one depends on the activated in-tent. If the Recipe class object is of nonetype all these functions will return theerror prompt ’No recipe selected’.

They are pretty straight forward, read whole recipe displays and recitates thewhole recipe when called, read step uses the json file to chose and present a specificstep of the recipe, find step returns the instruction for a certain step-number,

1https://cloud.google.com/speech-to-text/

19

3.7 Operate in recipe

read specific searches for a specific instruction with help of keywords and alsouses the most common function to determine which sentences in the instructionsthat best match the search. read beginning is the function called when generalrecitation commands are made such as ’Read recipe’ and ’Can we start cookingnow?’, and at the same time it calls several other functions if the input containscertain words.

20

4 EvaluationIn this chapter, the method and results of the evaluation are presented. The purposeof the evaluation is to get data from more sources than us developers as to how wellthe program is performing and opinions on the usefulness of it.

4.1 MethodFor the evaluation of the program, 10 people were asked to test it. They werenot given any prior instructions except for the introductory greeting given by theagent when the program was started saying that the user has to wait for the word”listening” to come up on the screen before talking. They were given a few minutesto try the program freely to make themselves familiar with it, and when they feltready, they were then asked to perform a number of tasks. The tasks were allcommands that the program should be able to perform, but since different peoplehave different expectations on how different features should work, and formulatesentences differently, the tests were expected to provide useful information aboutpossible improvements of the program. The tasks were:

• Find a recipe with one key ingredient.

• Find a recipe with two key ingredients.

• Find a recipe with more than two key ingredients.

• Select a recipe, then go through the steps chronologically.

• Ask to hear all the ingredients.

• Get it to save the ingredients as a grocery list.

• Ask about the amount of servings.

• Ask about the amount of a specific ingredient.

• Ask about what to do with a specific ingredient.

• Ask about something specific in the instructions, (for example oven tempera-ture).

• Ask to hear a specific step (for example step 5).

• Cancel the recipe and search for a new one.

• Try to have a conversation with it (small talk). Tell it about how you arefeeling or ask about how it is doing.

When performing the tasks, the tester was asked to answer two questions for eachtask on a scale of 1-5:

• Did it work as expected? 1 = No, 5 = Yes.

• Do you think it is a useful function? 1 = No, 5 = Yes.

After finishing all of the tasks, the tester was asked to answer the followingquestions:

21

4.2 Results

• Did the program do everything you thought it would?

• Did it work the way you expected? If no, specify.

• Do you think any functions need improvements? If yes: what and how?

• Do you miss any specific functions? What?

• What do you think of how well the program understands you?

• What to you think of the speech synthesis?

• Would you use this program if it was available in the form of an app? Why/whynot?

• What could be done with the program to increase your interest in using it inthe future?

and to leave any other comments they had.

4.2 ResultsPresented in table 1 is the average score of how well the program features lived upto the testers expectations and how useful it was, both on a scale of 1-5, where 5was the highest.

Table 1: Average scores in evaluations on a scale of 1-5.

Task Expectations Usefulness1 ingredient 4.56 4.892 ingredients 4.22 5More than two key ingredients 4.33 4.89Go through the steps 4.56 4.78Hear all the ingredients 4.78 5Save the ingredients 5 4.89Amount of servings 4.22 4.89Amount of an ingredient 4.44 5Instruction for 4.44 4.78a specific ingredientAsk about something specific 3.22 4.78in the instructionsAsk to hear a specific step 4.11 4.22Cancel the recipe and search 5 4.78for a new oneSmall talk 3.78 3.11

As shown in the table, all features were judged useful (between 4 and 5 on a1-5 scale) except small talk. Most features functionality lived up to the testers’expectations, but the ones with the most problems were to ask about somethingspecific in the ingredients, this could be asking about oven temperature or how longto cook something, and small talk.

22

4.2 Results

Features

Everyone who tried the program said that the program had all of the expectedfeatures for a program like this, and one third say that the program could do evenmore than they expected. No one missed any features that they were expecting.However, when asked what features they would like to see added, there were a fewsuggestions. They were: being able to see pictures of the dish, being able to see afilm of the steps/instructions, to be able to pick that it should only show a certaintype of recipes, for example vegetarian or vegan recipes, or recipes excluding certainingredients, and to be able to pick metric or imperial units in the beginning.

Feature improvements

When asked about function improvement and what could be done to increase thetesters’ interest in the program, most people claimed that they needed the programto be faster. They seemed to have some problems with the speech recognition andspeech synthesis and said that it was too slow and that its ability to understandsimple works. For example, a few testers had problem with the program interpreting”two” as ”to”, and ”three” as ”tree”. The feedback for improvements also had todo with the NLP, and the interpretation of the input. At one point, a user said”I’m feeling happy” and the program answered with ”Oh no, what happened?” asit would have if it interpreted the input as the user feeling sad. The testers alsohad some issues with the program not understanding synonyms or alternative namesfor ingredients or cooking equipment. For example, one tester had the ingredient”coconut oil” in the ingredient list, but in the instructions it referred to the ingredientas ”oil”, so when the tester asked about what to do with the coconut oil, it got theresponse that the ingredient could not be found in the instructions. Other testersexperiences similar cases.

NLP and speech recognition

When asked about the NLP and how well the program understood the inputs, someproblems were noticed. Some of them have already been mentioned in ”Featureimprovements”. Most testers said that the program understood them okay, butsometimes ran into trouble with pronunciation or certain sentences. Only one testerhad no problems what so ever with the NLP and speech recognition. A few testersseemed to have a problem with the program not understanding their accent, and onetester thought that maybe it prefers American pronunciation as apposed to British.Another problem that some testers had was that they could not pause in the middleof a sentence because the program would cut them of and try to interpret only halfof the sentence as input. This mostly happened when testers wanted to search formore than one ingredient and paused between the ingredients, and the programtherefore didn’t perceive all of them.

Speech synthesis

Not surprisingly, most testers said that the program sounded ”robot-like”, and leftcomments like ”neutral”, ”not very fun” and ”sounds like a computer”. One testernoticed faults in the pronunciation of some words. Another tester said that it wouldbe good if the user could interrupt the program in the middle of reading something,maybe with some sort of verbal command, so that the user wouldn’t have to waituntil the program had finished reading in case they didn’t want to hear any more.

23

4.2 Results

Recipes

All of the testers said that the recipes from the Bon Appetit website were excellentand varied. One person had some problems when searching for more than oneingredient with it not presenting recipes with all ingredients, but with only one or twoof the three that were searched for. Some user also noticed that the program missedone or two ingredients when it said what it was searching for, but then still searchedfor the right thing. It would say ”Do you want to search for ingredient 1, ingredient2 and ingredient 3?”, the user would say ”yes” and then it would say ”Searchingfor recipes with ingredient 1”, but still present recipes with all ingredients. Anotherproblem that arose during the evaluation is that sometimes the same recipe waslisted two times in the same search.

Personal use

When asked if they would use the program if it was an app that they could downloadto their phones, testers said either yes or maybe. Two thirds say that they woulduse it, and two of them gave the comments ”perfect to use in grocery store to getshopping list” and ”It provides excellent tips for recipes”. Out of the ones whoanswered maybe on this question, one said it depended on if they were alone or not.

Small talk

When commenting the small talk feature, most testers said that it was a fun feature,but that it was not necessary for the program. One tester said ”I’m not interested intalking to my cook book”. There were also comments about the program not beingable to understand enough to be able to carry a conversation with, which reflectson the rating below 4 in how it lived up to expectations in table 1.

24

5 Discussion

5.1 EvaluationThe evaluation of the program provided some insight into the flaws and possibleareas of improvement of the program, as well as opinions on how well conversationalagents can be integrated into cooking and used as a cooking assistant.

First of all, the program’s features’ ratings for usefulness were all high exceptfor small talk. Some of them were rated 5 on a scale of 1 to 5 by all testers, andmany others were rated between 4.5 and 5. The only one below 4.5 except smalltalk was hearing a specific step (for example step 5). It can be hard for a user toremember what a specific step is, and to ask for a step by number will thereforenot be as relevant as asking about what to do with a specific ingredient or cookingequipment. The ratings of how well the functions lived up to the testers’ expecta-tions varied some, but were for most functions between 4 and 5, which indicates thatthe testers were for the most part happy about the program’s performance. Askingabout something specific in the instructions got a lower rating, 3.22, which indicatesthat many of the testers had problems when doing that task. This can be becauseof problems with Dialogflow discussed in section 5.1.1, but could also be because ofimproper coding. To know what the reason behind this the function needs to beevaluated further and the failed tries analyzed.

Regarding feature improvements, the main focus of most of the testers was thespeed of the program. The problems with the speed seemed to be mostly related tothe speech recognition and speech synthesis modules, but also emerged because ofconnectivity issues at some points. The connectivity issues had nothing to do withthe programs, but are still interesting, because it seemed as though the limitingfactor sometimes was the uploading speed of the network and not the downloadingspeed. When communicating with the Google Speech Recognition API, audio filesare uploaded and analyzed by the API. If the uploading speed is too low, it will takelonger to send the audio files and it will appear to the user as if the program is stilllistening because their words haven’t been written out in text yet. The appearanceof the program being slow can also have to do with how the speech recognitionmodule works. As explained in section 3.5, the recognition function starts recordingwhen audio is heard, and stops recording when a set number of seconds of silenceis detected, without including the silence in the recording. If the set seconds ofsilence is long, the recognition will appear slow. This is a fine balance, because ifthe number of seconds of silence is too small, the recording will be cut off when theuser is pausing between words, which is also something that some testers complainedabout. From these evaluations it can therefore not be concluded if the number ofseconds should be increased or decreased.

Another improvement area was understanding of single words and short phrases.The Google Speech API uses machine learning, which means that it in addition totrying to hear exactly what the user says, uses statistics to determine how likelyit is that a word was said. Although Google is careful about revealing to much ofhow their APIs work, there is reason to believe that the algorithms they use can beadjusted and tweaked to perform better in different situations. For example, whenour software asks for a number input, one could argue for the possibility of increas-ing the likelihoods for the input being a number, i.e. making it more probable that

25

5.1 Evaluation

the input is “four” instead of “for”. This would eliminate some to the problems thetesters had with similar cases (”two” vs ”to” and ”three” vs ”tree”). If Google’s APIcannot be adjusted to accommodate for this, there are other similar modules whichsurely can.

The problems discovered in the evaluation that were related to the NLP andinterpretation of the input have to do with Dialogflow, and could most likely beeliminated with some adjustments. The specific case here was that ”I’m feelinghappy” was interpreted by the agent as if the user was feeling sad, and could bethe result of that the training phrase for the ”happy”-input is ”I’m happy” and thetraining phrase for ”sad”-input is ”I’m feeling sad”. The algorithms should, butdon’t seem to know that ”happy” and ”sad” are the keywords and not ”feeling”.If more training phrases with different formulations of the intent are created, thealgorithms would be able to better identify the keywords and understand the context.

That the program did not identify synonyms brought up some issues in the eval-uation. This problem could be solved in a variety of ways, with the most straightforward ones being either using Dialogflow, or defining a dictionary of relevant syn-onyms and hard coding the program to check for synonyms when searching foringredients or instructions.

The issues the program had with some of the testers’ pronunciations comes withusing Google’s Speech API and is hard to do anything about without switching toanother module. However, since Google’s Speech API is one of the most advancedof the kind, we would probably not have more luck with other modules either.

The speech synthesis of the program got some critique for sounding, not sur-prisingly, ”robot”-like and ”like a computer”. Alternatives to using the computer’sown speech synthesizer is using a platform like Google’s Speech API but for speechsynthesis. There are many options like this available, and they are constantly devel-oping and getting more ”human”-like in their speech, but often users have to pay forthem. One therefore has to assess the value that a more ”human”-like voice bringsagainst the cost of using the platform.

The problems that arose related to the recipes and the recipe searching mostlyhave to do with how the Bon Appetit website works. When searching for recipeswith multiple ingredients or keywords, the website will try to present recipes thathave all of the keywords, but if it does not find many or any at all, it will presentrecipes with one or two of the keywords. The fact that recipes were sometimes listedtwice in the same search is also a fault from Bon Appetit’s side, but that could easilybe handled by a few lines of code in the program. When the program said what itwas searching for and missed one or two ingredients, it was because of an error inthe response from Dialogflow. In fact, the line that coded for that sentence couldjust be removed since it is already stated to the user what is being searched for whenthe user is being asked about if the identification of keywords is correct.

The testers’ positive responses when asked if they would use the program if itwas commercially available indicates that integrating a conversational agent intothe area of cooking and recipe searching worked well with our program. Out of thetesters that said they would use the program, only two gave comments, and both

26

5.2 System limitations

of the comments were about how well it worked for the recipe searches. This couldbe an indication of several things; that the project’s conversational agent was moresuccessfully integrated into the recipe searching area than the cooking guidance area,that those two specific users where more interested in finding recipes than havingthe recipe instructions read to them, or that the recipe search area is more fitting forintegration of conversational agents than the cooking guidance area. To determinethis, more evaluations would have to be carried out.

5.2 System limitationsBefore the evaluation was started, but after the program was finished, some errorsin the program were observed. These errors are addressed in this section along withsome difficulties that were stumbled upon while developing the program.

5.2.1 Recipe scraping and selecting recipes

As mentioned in section 2.2, Recipe Scrapers sometimes has trouble scraping thecorrect recipe information from some pages. This stems from the fact that notall recipe pages have same layout, and that Recipe Scrapers is coded for themost common format only. At one point it was noticed that only some of theinstructions of a recipe were scraped, so the source code for that particular page(https://www.bonappetit.com/recipe/blue-cheese-and-bacon-lettuce-boats) was exam-ined along with the code of Recipe Scrapers. It was then discovered that RecipeScrapers assumed that all of the instructions where written in the same section inthe source code, and failed to retrieve all of the information when it was not.

In the select recipe-function, it is assumed that the user will select a recipeby saying a number between 1 and 5, but will not accept for example ”I want recipenumber 5”. This is because of bad coding and could easily be fixed. It could also beremedied by using Dialogflow to interpret the input, but since the function was codedbefore Dialogflow was integrated into the program, this option was not considereduntil later.

5.2.2 Dialogflow

The difficulties encountered when using Dialogflow was to make fine deviations andseparations internally between different intents and entities. A problem that oc-curred often was that the training phrases implemented in an intent took over thesmall talk feature, reducing in the natural flow of the conversation with the agent.For instance one smalltalk section was ‘I need a hug’, which instead was interpretas a recipe starter – training phrase, since it is so similar to ‘I need a recipe’.

Also the balance between utilizing the maximum machine learning of the programand still have a secured path for the correct interpretations is hard. The machinelearning techniques of programs makes them “guess” what to do depending on earlierexperiences or settings, in this case databases of training phrases and entities. Andthis is of course desired, but it is not optimal for it to guess too much. An exampleof this that, which was also brought up earlier, was problems with the word ‘okay’since when mentioned, the NLP started the process of searching for a recipe with‘okay’. The program is designed to search for words that are ingredient-alike, and insome way ‘okay’ was assumed to be one of them. One could ask how the program

27

5.3 Improvements & comparisons

jumps to that conclusion, but on the other hand it is wanted that the program isable to search for ingredients it has never encountered before. One way to addressto this issue was to create an entity which was called ‘filling words’ to categorizewords without any value for the purpose of the agent.

5.3 Improvements & comparisonsThe suggestions that were raised as improvements to the program in the evaluationswere being able to see pictures of the dish, being able to see a film of the steps/in-structions, to be able to pick that it should only show a certain type of recipes, forexample vegetarian or vegan recipes, or recipes excluding certain ingredients, andto be able to pick metric or imperial units in the beginning.

To be able to pick that it should only show a certain type of recipes, for examplevegetarian or vegan recipes, could be easily implemented because of Bon Appetit’ssimilar function. Bon Appetit has filters that allows a visitor to pick if they wantthe recipe to be vegetarian, gluten free, healthy or vegan. It also has filters forselecting what meal & course one wants, for example dinner, breakfast, snack ordessert. Another filter allows a visitor to check in ingredients that must be in therecipe. There is no built in filter in Bon Appetit that allows a visitor to ”block” orexclude certain ingredients, but it could integrated to the program by coding, forexample by, before listing the recipes, scraping them to check the ingredients andthen excluding the one with the undesired ingredient.

The first two suggestions require a visual interface for showing pictures, texts andvideos. As discussed in section 2.1.1, combining a visual interface with a conversa-tional one comes with advantages like the ones suggested, and should be considereda relevant possible future improvement of the program.

To get a more extensive picture of how the conversational agent technology canbe integrated to the area of cooking and recipe searching, the evaluation could beextended. More testers with different backgrounds and interests would provide abetter and wider base of results. The evaluations could also be executed in thekitchen allowing the tester to try the program in a ”live” scenario, leading to morerelevant results.

5.4 Ethics and sustainabilityIt is well known that it is more ecologically sustainable to eat a diet without animalproducts, but there are still ongoing discussions about how the necessary animalproducts are for human beings. The project program could encourage users to eata more vegetarian or vegan diet, and maybe even only show such recipes, but couldthen risk being perceived negatively by users who consider animal products an es-sential part of the human diet. This constitutes a conflict of interest in the sensethat, on the one hand, we want to emphasize and encourage sustainable develop-ment, on the other hand, we want to create a relevant service that has the potentialto address as many users as possible.

While the conversational agent technology is improving, so is the usage of theagents, and the difficulty discovering when they are used for destructive purposes.

28

5.4 Ethics and sustainability

Spambots are examples of chatbots that automatically send out unsolicited messagesthrough different channels. The purposes of this can be advertisement, to increasea website’s search engine ranking, or more destructive purposes like tricking usersin some way. When it becomes harder to discover this, it also becomes easier forpeople to use this technology to scam people.

Another ethics related fear among people is that computers, machines and ar-tificial intelligences will deprive people of their jobs. Technology development haseffects in many areas, and especially in the work area. At an individual level, aperson’s job could be replaced by technology, but people are skilled at adapting andusually find new tasks and opportunities. For an engineer, it can be as simple asfrom one day counting the numbers to the other day programming the techniquewhich instead makes the calculations. In other situations, and especially for lesseducated people, it can be significantly more complicated. It can be difficult to geta further or new education if you are not be able to afford or access any education.There is usually a solution; one can trade industry completely, change geographicalarea and so on, but for some people, it just doesn’t work out. This has made it moreimportant than ever to think forward in the choices we make in life and think: ”willmy education be relevant in 10, 25, 50 years? Will the profession I’m aspiring tostill be a profession in the future or will it be replaced by technical solutions? Thechoices of education and job, and even the foundations of the educations themselves,are adapted to what is needed at the time and what is expected to be needed in thefuture. From a broader perspective, looking at an entire society, and considering thesituation over a long period of time, the question becomes simpler. You can thenlook at data in the area and conclude that jobs do not disappear but are usuallycreated by technology development. Of course, what the jobs are will change, forexample, when a job within service is lost, maybe two jobs within IT are created. Insummary, it can be said that from a short-term individual perspective technologycan be problematic for people’s job situation, but rather from a longer-term socialperspective, it is probably positive for the job market.

29

6 ConclusionThe goal of this thesis project was evaluating to what extent a conversational agentis useful in the kitchen. Even though the evaluation was restricted to a small amountof testers, it nevertheless provided information of great use for this project. It turnsout that a conversational agent implemented with help of Google API has positivereactions among the testers, and that they believed they were to use it if improve-ments were made.

It is concluded that a conversational agent like the one created in this projectdefinitely can be integrated and of use in the kitchen. The greater part of thetesters said they would use it if commercially available. If improvements based onerrors and the feedback of the testers were to be made it is believed to increase thepracticality and satisfaction of the user further.

30

REFERENCES

References[1] JSON. URL http://json.org/.

[2] AltexSoft. Fraud Detection: Machine Learn-ing in Fintech and eCommerce. URLhttps://www.altexsoft.com/whitepapers/fraud-detection-how-machine-learning-systems-help-reveal-scams-in-fintech-healthcare-and-ecommerce/.

[3] Natesh M Bhat. pyttsx3 Documentation. page 21, May 2018.

[4] Alexander Blom and Sofie Thorsen. A sentiment-based chat bot. 2013. URLhttp://kth.diva-portal.org/smash/get/diva2:670679/FULLTEXT01.pdf.

[5] BonAppetit. Bon Appetit Magazine: Recipes, Cooking, Entertaining, Restau-rants | Bon Appetit. URL https://www.bonappetit.com/.

[6] Gobinda G. Chowdhury. Natural language processing. AnnualReview of Information Science and Technology, 37(1):51–89, Jan-uary 2005. ISSN 00664200. doi: 10.1002/aris.1440370103. URLhttp://doi.wiley.com/10.1002/aris.1440370103.

[7] Tiphanie Deniaux. Investigate more robust features for SpeechRecognition using Deep Learning. page 58, 2016. URLhttp://kth.diva-portal.org/smash/get/diva2:912705/FULLTEXT01.pdf.

[8] hhursev. recipe-scrapers: Python package for scraping recipes data, May 2018.URL https://github.com/hhursev/recipe-scrapers. original-date: 2015-09-14T12:05:00Z.

[9] Google Inc. Actions and Parameters, . URLhttps://dialogflow.com/docs/actions-and-parameters.

[10] Google Inc. Intents | Dialogflow, . URLhttps://dialogflow.com/docs/intents.

[11] Jennifer Lai. Conversational Interfaces. Commun. ACM, 43(9):24–27,September 2000. ISSN 0001-0782. doi: 10.1145/348941.348971. URLhttp://doi.acm.org/10.1145/348941.348971.

[12] Bradford Mott, James Lester, and Karl Branting. ConversationalAgents. In The Practical Handbook of Internet Computing, volume20042960. Chapman and Hall/CRC, September 2004. ISBN 978-1-58488-381-4 978-1-4665-2690-7. doi: 10.1201/9780203507223.ch10. URLhttp://www.crcnetbase.com/doi/abs/10.1201/9780203507223.ch10.

[13] Kenneth Reitz. Requests: HTTP for Humans — Requests 2.18.4 documenta-tion. URL http://docs.python-requests.org/en/master/.

[14] Alexander Roghult. Chatbot trained on movie dialogue. page 24, 2014. URLhttp://kth.diva-portal.org/smash/get/diva2:770821/FULLTEXT01.pdf.

[15] William Sitwell. A history of cookbooks | The Bookseller, 2012. URLhttps://www.thebookseller.com/feature/history-cookbooks-338870.

[16] Stefan Behnel and et al. lxml - Processing XML and HTML with Python, 2018.URL http://lxml.de/.

31

REFERENCES

[17] Amanda Striger. End-to-End Trainable Chatbot for Restaurant Recommenda-tions. page 64, 2017.

[18] Clare Swanson. The Bestselling Cookbooks of 2016, 2017. URLhttps://www.publishersweekly.com/pw/by-topic/industry-news/cooking/article/72521-the-bestselling-cookbooks-of-2016.html.

[19] Favio Vazquez. Deep Learning facil con DeepCognition – PlanetaChatbot : todo sobre los Chatbots y la Inteligencia Artificial. URLhttps://planetachatbot.com/deep-learning-f%C3%A1cil-con-deepcognition-9af43b2319ba.

[20] Joseph Weizenbaum. Computational Linguistics. ELIZA - AComputer Program For the Study of Natural Language Com-munication Between Man and Machine. 9(1):10, 1966. URLhttp://web.stanford.edu/class/cs124/p36-weizenabaum.pdf.

[21] Victor W Zue and James R Glass. Conversational Interfaces: Advances andChallenges. 88(8):15, 2000.

32

www.kth.se

Documents

Conversational agent as kitchen assistant1220102/FULLTEXT01.pdf · Chatbots, also called conversational agents, with speech interfaces are being used to a greater and greater extent,