40
IT 16 047 Examensarbete 30 hp Augusti 2016 Using social media and machine learning to predict financial performance of a company Sepehr Forouzani Masterprogram i datavetenskap Master Programme in Computer Science

Using social media and machine learning to predict ...uu.diva-portal.org/smash/get/diva2:955799/FULLTEXT01.pdf · Using social media and machine learning to predict financial performance

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

IT 16 047

Examensarbete 30 hpAugusti 2016

Using social media and machine learning to predict financial performance of a company

Sepehr Forouzani

Masterprogram i datavetenskapMaster Programme in Computer Science

Teknisk- naturvetenskaplig fakultet UTH-enheten Besöksadress: Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0 Postadress: Box 536 751 21 Uppsala Telefon: 018 – 471 30 03 Telefax: 018 – 471 30 00 Hemsida: http://www.teknat.uu.se/student

Abstract

Using social media and machine learning to predictfinancial performance of a company

Sepehr Forouzani

Social media have recently become one of the most popular communicating form ofmedia for numerous number of people. the text and posts shared on social media is widely usedby researcher to analyze, study and relate them to various fields. In this master thesis,sentiment analysis has been performed on posts containing information about two companiesthat are shared on Twitter, and machine learning algorithms has been used to predict thefinancial performance of these companies.

UPTEC 16 047Examinator: Edith NgaiÄmnesgranskare: Micheal AshcroftHandledare: Lisa Kaati

Contents

1 Introduction 5

1.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Related Work 8

3 Background theory 11

3.1 Social media . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.2 Sentiment analysis . . . . . . . . . . . . . . . . . . . . . . . . 12

3.3 Financial performance . . . . . . . . . . . . . . . . . . . . . . 13

3.4 Data collection . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.4.1 Feature Vectors . . . . . . . . . . . . . . . . . . . . . . 14

3.5 Machine learning . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.5.1 Classification Algorithms . . . . . . . . . . . . . . . . . 16

3.5.2 Data balancing . . . . . . . . . . . . . . . . . . . . . . 18

3.5.3 Feature selection . . . . . . . . . . . . . . . . . . . . . 18

4 Implementation 19

4.1 Financial Performance Predictor design . . . . . . . . . . . . . 19

4.2 Financial Performance Predictor Implementation . . . . . . . 21

4.2.1 Collecting data . . . . . . . . . . . . . . . . . . . . . . 21

4.2.2 Feature vectors creation . . . . . . . . . . . . . . . . . 21

5 Experiments and Results 22

5.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

5.2 Quarterly reports . . . . . . . . . . . . . . . . . . . . . . . . . 23

5.3 Dictionaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

1

5.4 Weka . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

5.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5.5.1 Experiments with the regular dictionary . . . . . . . . 27

5.5.2 Experiments using the financial dictionary . . . . . . . 30

6 Discussion 31

7 Conclusion 32

8 Future work 32

2

List of Figures

1 The methodology . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Sentiment Analysis methods [18] . . . . . . . . . . . . . . . . . 12

3 Machine learning workflow [34] . . . . . . . . . . . . . . . . . 15

4 Steps toward financial prediction . . . . . . . . . . . . . . . . 20

5 The format of a feature vector. . . . . . . . . . . . . . . . . . 22

3

List of Tables

1 The datasets used in the experiments. . . . . . . . . . . . . . . 22

2 The companies performance based on the ROA. . . . . . . . . 24

3 The two different dictionaries and some example words. . . . . 25

4 Confusion matrix . . . . . . . . . . . . . . . . . . . . . . . . . 26

5 The results for experiment 1 using TWBMW dataset. . . . . . 28

6 The results for experiment 2 using TWBMW dataset. . . . . . 28

7 The results for experiment 3 using TWBMW dataset. . . . . . 29

8 The results for experiment 3 using TWVW dataset. . . . . . . 29

9 The results for experiment 4 using TWBMW dataset. . . . . . 30

10 The results for experiment 4 using TWVW dataset. . . . . . . 30

4

1 Introduction

Nowadays media and in particular social media is considered as a big data

source to researchers due to the large number of people communicating and

sharing their ideas, feelings, knowledge, and personal opinions about various

topics at any time. During the last ten years, Twitter and Facebook has

emerged to be the most popular social networking websites. Facebook has

1.59 billion monthly users and Twitter has 332 million active users [6].

Data from social media provides a unique opportunity to social scientists,

economists, and statisticians to understand individuals and human behav-

ioral patterns that has effects on different areas such as finance [4]. As an

example, recent research on financial performance prediction using opinion

and sentiment analysis of posts that are shared in social media indicates that

there is a possibility to predict a company’s stock value [5].

The data available on social media is enormous, unstructured and con-

tains a lot of irrelevant information, therefore it is impossible for individuals

to read and analyze all of the data manually. To analyze data from social

media, statistical and data mining techniques need to be applied to make the

best use of the data [7].

Customer’s opinion about products and services is always a concern for

most large-and middle sized companies. Social media is one of the most

widely used source of data about customer’s opinion toward a certain com-

pany [8]. Most companies use different methods and techniques to find out

customer’s opinion about their services and products. However relating the

data extracted from social media about customer’s opinion to the co-related

sectors of the companies such as productivity, profitability, financial per-

formance and economics is not always possible [21], for example if a firm

improves productivity by downsizing, the profitability might be endangered

5

if the customer satisfaction depends on companies services [23]. Research [1]

has shown that there is a relation between opinion and sentiment about a

company and the stock price. However, to the best of our knowledge there

are no studies that focus on investigating the relation of sentiment analysis

of tweets and the financial performance of companies.

In this master thesis we will investigate the correlation between the sen-

timent of tweets where a certain company is mentioned in a hashtag and the

financial performance of that company.

1.1 Objectives

The over all objective of this thesis project is to investigate the relation

between sentiment extracted from social media and the financial performance

of automotive companies. The goal is to predict the financial performance of

a company based on what people write about the company on Twitter. This

results in the following more specific objectives:

• Develop techniques for sentiment analysis of data from Twitter with

respect to a specific company.

• Use machine learning and train a model to predict the financial perfor-

mance of a company

• Develop a prototype tool for the proposed method.

1.2 Method

The work in this thesis is done through five steps, as illustrated in Figure 1.

6

Figure 1: The methodology

In the first step, the problem and the objectives for the research is defined.

In the second step a literature review is done. The literature study focus on

reviewing related work as well as gaining knowledge about the techniques

that will be used in the project.

In the third step, the experiment setups and configurations will be de-

signed and data will be collected.

In the forth step, a prototype tool is developed in order to collect, prepare

and analyze data. The analysis is based on mood and sentiment word lists.

For the machine learning components in this project the Weka data mining

tool [39] is used. In the fifth step, the results are evaluated by measuring the

accuracy of performance prediction.

7

2 Related Work

In this chapter some work related to sentiment analysis methods and financial

predictions using mood and sentiment analysis, will be reviewed.

In [1] the authors are collecting public tweets posted by approximately 2.7

million users. All tweets have an identifier, a publishing time, a submission

type and a 140 character text. To make the data suitable for analysis, stop-

words (topic independent words that are most common in a language) and

punctuation are removed and then the text is filtered by words such as ”I

feel”,”i am feeling”, ”I’m”,”Im”,”I am”, and ”makes me” because those words

state their author’s mood state. At the next stage they use the OpinionFinder

(OF) tool [13] for sentiment analysis. In order to measure polarity of a

sentence in terms of being positive and negative, OF takes a text (e.g. large

number of tweets) and uses the OF lexicon to determine the percentage

of positive against negative sentiment of the text. To measure mood of a

text they use an algorithm called Google-Profile of Mood States (GPOMS).

GPOMS measures the mood of a text from six different dimensions, which

are: calm, alert, sure, vital, kind, and happy.

To enable normalization of time series and comparison between OF and

GPOMS results, the authors of [1] are using z-score statistical measurement

which is based on local mean and standard deviation. The authors are also

using econometric technique of Granger causality analysis [19] in order to

investigate the relation between public mood and stock market closing value

changes. The Granger causality indicates that there is a predictive relation

of certain mood categories and the closing price of the stock market.

In [3] the authors used machine learning and social media to predict

how successful a movie will be. In order to measure success of a movie the

authors used return on investment (ROI) which is a profitability metric, and

8

they applied binary and multi-class classification algorithms such as support

vector machines (SVM), multilayer perceptron (MLP), decision trees (J48),

random forest and logitBoost algorithm to predict the success. The results

shows that random forest was the best classifier, with an accuracy of almost

84%.

In [12] the authors investigate the possibility of predicting electronic de-

vices market sales using social media. In their work they are analyzing sen-

timent of Twitter comments about a certain product before the product is

released. They are using semi-supervised recursive auto encoders for pre-

dicting sentiment distribution. Semi-supervised recursive auto encoders is

an artificial neural network which its goal is to learn encoding a set of data,

typically for the purpose of dimensional reduction. In sentiment analysis

semi-supervised recursive auto encoders are used to learn semantic vector

representations of a phrases [20]. After running sentiment analysis, the to-

tal number of comments, number of positive comments, total number of re-

tweeted comments and number of re-tweeted positive comments are extracted

and used as features in their model. In the experiments their model showed

35% of accuracy in prediction of iPad3 sale meanwhile linear regression was

showing 58% accuracy in iPad3 sale prediction which is a low accuracy and

could not be used as a practical model.

In [2] the authors are using Artificial Neural Networks (ANN), Support

Vector Machines (SVM) and Relevance Vector Machines (RVM) to predict

daily returns for an FX carry basket. A currency basket is a portfolio of

selected currencies with different weightings, and FX carry basket is made

of a long position in high yielding currencies versus a short position in low

yielding ones is a common asset for fund managers and speculative traders. It

was found that in general the committee of networks was much more effective

9

at predicting five day returns than one day returns, and it was on this basis

that the optimal configuration was used.

In [9] it is stated that the list of words that is used in general to measure

the sentiment of a text is not accurate to be used to measure sentiment

of finance related texts. To illustrate this, the authors of [9] did a review

of the negative words extracted from 10-k reports (an annual report which

contains summery of a company’s financial performance [15]) based on the

Harvard dictionary [14] and found out that almost seventy five percent of the

words counted as negative are not negative in finance. Therefore they have

developed a new word dictionary which reflects the tone of financial texts

with a higher accuracy. The authors have used a bag of words (considering

a text like a bag for its words, regardless of grammar and order of words)

approach to produce vector of words and word counts, and modified one of

the most common term weighting scheme to make it adjustable to document

length.

In [10] the authors are developing an automated method for sentiment

classification. They are using a classifier which is based on a multinomial

Naive Bayes classifier to determine the positive, negative and neutral sen-

timent of a document. They also propose a technique that can be used to

determine sentiment of documents in any languages. In their method, the

TreeTagger [16] (a language independent part-of-speech tagger) is used for

part-of-speech tagging and the differences in distribution of positive, nega-

tive and neutral tags are observed. For feature extraction they used N-gram

as binary features and the frequency of keywords. Unigrams, bigrams, and

trigrams are used for experiments, and the authors are stating that when

bigrams are used, the performance is the best.

In [11] four classes of mood: calm, happy, alert and kind are used and

10

a text is categorized into these four classes using a analysis tool. The tool

uses a word list based on the Profile of Mood States (POMS) questionnaire

[17] where the POMS different states are mapped into their four mood states

using static correlation rules. They also filtered down a set of tweets into

emotion specific texts using words such as ”feel”, ”makes me”, ”I’m”, ”I am”.

In this work the authors are using a new cross validation method called k-fold

sequential cross validation to train the model and the model showed 75.56%

accuracy in prediction of stock market movements. They have tried four

different learning algorithms: linear regression, logistic regression, support

vector machines (SVMs), and self organizing fuzzy neural networks (SOFNN)

to learn and study correlation of mood and market. The conclusion is that

SOFNN performed better compared to the other algorithms.

3 Background theory

3.1 Social media

The tools and platforms that enables users to interact and exchange informa-

tion in different forms such as text, picture, video and etc. are called social

media [24]. There are a number of different types of social media for exam-

ple blogs, discussion boards and networking platforms such as Facebook and

Twitter. Twitter is one of the most popular social media services that enable

users to publish and share a maximum of 140 characters text called tweets

and use hashtags ”#” to relate their tweets to a specific topic, person or a

company. Several companies and business strategists consider social media

as an important arena and they are constantly trying to find out various

ways to increase their profitability using social media[25].

11

3.2 Sentiment analysis

Sentiment analysis is done using natural language processing and information

extraction with the goal of obtaining the writer’s feeling as positive, negative

or neutral [27]. Sentiment analysis is often used as component in opinion

mining when the goal is to is to analyze sentiment and attitudes [28]. There

are a number of various methods that can be used to classify sentiment of a

text. A list of methods are shown in Figure 2.

Figure 2: Sentiment Analysis methods [18]

In this thesis the Dictionary-based approach is used for sentiment analy-

sis.

12

3.3 Financial performance

Most of the time financial analysts and investors are focusing on return on

equity (ROE) as the primary metric for measuring companies performance.

Many executives focus heavily on this metric as well, believing that it is the

one that seems to get the most attention from the investor community. ROE

is calculated by dividing the net income by shareholder’s equity.

Return on Equity =Net Income

shareholder�s equity(1)

Shareholder’s equity is the equity of a company as divided among individ-

ual shareholders of company’s stock [48]. Using ROE as performance metric

has some shortcomings as well. As an example, companies can artificially

maintain a good value of ROE by growing debt leverage and stock buybacks

which are funded through accumulated cash. Therefore other metrics such

as return on assets (ROA) can be used instead of ROE. ROA directly consid-

ers the assets that are used to support business activities and it determines

whether a company is able to generate sufficient return on the assets rather

than simply showing robust return on sales [29]. ROA is an indicator of a

company’s profitability based on its total assets [31], it captures the funda-

mentals of company’s performance in a general way by looking at both income

statement performance and the assets required to run a business [22]. ROA

is a good metric to measure performance of a company on generating income

by using the assets. ROA is calculated by dividing a company’s earnings by

its total assets and displayed as a percentage. Sometimes ROA is referred to

as ”return on investment”. ROA is calculated using below formula:

Return on Assets =Net Income

Total Assets(2)

13

3.4 Data collection

Data collection and dataset creation is the first step when you want to create

a statistical model using machine learning. The dataset is commonly divided

into three subsets: a training set, a validation set and a test set. The train-

ing set is used to train the statistical model, the validation set is used to

estimate how well the model is trained and the test set is used to measure

the performance of the model.

3.4.1 Feature Vectors

A feature vector is the way an object is presented in machine learning and

pattern recognition. Feature vectors are n-dimensional vectors where each

vector represents an object. A numeric representation of the features (vari-

ables) will enhance statistical analysis, therefore many machine learning al-

gorithms requires numerical features.

3.5 Machine learning

Machine learning is a field of computer science which studies and explores

ways of making algorithms find patterns or learn how to do certain tasks. In

this thesis machine learning is used to predict the performance of a company.

Figure 3 shows the workflow for the machine learning process we have used

in this thesis.

14

Figure 3: Machine learning workflow [34]

In the first step (data ingestion) the data is collected and stored in a

database. After collecting the data, the data is cleaned and/or transformed.

The data is divided into two sets: a training set and a testing set. In the

next step a mathematical model is built based on the training set and then

the model will be tested against the testing set.

In order to improve the results, the user can make decision about creating

or choosing different data and feature vectors (data presentation style), after

results are produced from the model.

There are three categories of machine learning that are based on their

nature of learning.

• Supervised Learning: In supervised learning the computer receives a

set of inputs and their related outputs from a teacher. The goal is to

find a general mapping model from input to output.

• Unsupervised Learning: In unsupervised learning, the computer find

structures in the input data without having any input from a teacher.

• Reinforcement Learning: In reinforcement learning the computer inter-

15

acts with an environment to achieve the goal without any help from a

teacher.

3.5.1 Classification Algorithms

A classification algorithm task is to pick the right identified categories in

data, for the new observations, the classifier estimates categories for new

data based on the model parameters that are learned from the training data.

Different classification algorithms use different classifier methods and vari-

ables and therefore a number of classification algorithms can be applied on

the data in order to find the most suitable and efficient algorithm [30]. In this

section a few different classification algorithms that are used in the project

will be reviewed.

Random Forest [35] is bagged trees with both bootstrap sampling of

the data and a form of attribute bagging. A decision tree is made of a

directed series of decisions, based on input variables value, and culminating

in a classification of the target variable. Bagging is a method of combining

multiple predictors. It will get a bootstrap sample from training set and

train a predictor on that sample. Samples with replacement from the known

weights called a bootstrap sample. Random forests provide a simple means

of analyzing feature importance, and the resulting score is known as the

variable importance score. In random forest it is not required to separate

a test set from the data to get an unbiased estimate of the error since each

tree in random forest is built by using a different bootstrap sample from the

original data. Bootstrap is an algorithm, designed to improve the stability

and accuracy of machine learning algorithms

Naive Bays [33] is a probabilistic classifier that uses Bayes theory with

the assumption that the features are independent (occurrence of one feature

16

does not effect the probability of others). Naive Bayes computes probability

p as the probability of feature x represented by a vector x = (x1, ..., xn) being

in the class c : p(c|x). The conditional probability using Bayes theorem can

be shown as:

p(c|x) = p(c)p(x|c)p(x)

(3)

when training model time is important Naive Bays is useful.

AdaBoost [32] stands for adaptive boosting and it assumes that finding

many weak models are easier than finding one accurate model. Boosting is an

approach to create predictions rules with high accuracy using a combination

of weak models and rules that have low accuracy in prediction. Boosting

generates a sequence of base models and then decides a final estimate of

the target variable based on aggregating the estimates of the base models.

AdaBoost generates a numbers of weak classifiers and a final estimate of the

target variable is chosen based on aggregating the estimates made by the

base models. Similar to the random forest algorithm, AdaBoost also have a

variable importance estimation but in a different way. In AdaBoost the more

informative variables are used more often, and the less informative features

are barely used.

Cross validation [42] creates a training set and a test set by partitioning

the original data with the goal to train and evaluate the model. In k-fold

cross validation the original data will be divided into k number of subsamples.

One subsample is selected as test dataset and the rest (k − 1) number of

subsamples are used as training set for the model. The same process will be

repeated for k number of times (folds) and each subsample will be used at

least once as test set and then the results will be averaged or combined to

make the best estimation.

17

3.5.2 Data balancing

If the number on instances in classification categories in a dataset are having

a huge difference, the dataset is called imbalanced. To counter the issues

of imbalanced data, methods such as over-sampling (creating new samples

of a certain class) and under-sampling (removing instances of a class) have

been proposed. Synthetic Minority Oversampling TEchnique (SMOTE) [36]

is an over-sampling algorithm which provides more instances of the class

with lower number of instances in addition to under-sampling of the class

with more number of instances. In SMOTE, based on the required number

of over-sampling K number of the nearest neighbor to the data point is

selected and then after these steps the synthetic sample will be created:

• Take the difference of a data instance to its nearest neighbor,

• Multiply the number by a random value between 0 and 1,

• Add the new data point to the considered feature vector

3.5.3 Feature selection

The process of selecting a subset of features that should be used to construct

the model is called feature selection. In machine learning and statistics, the

process is also called variable selection. There are various ways to do feature

selection. As an example, information gain IG specify the most important

features following the formula:

IG(T, a) = H(T )−H(T |a) (4)

where:

T is set of training example,

a is the index of a feature

18

H() function is an entropy (Entropy is a measure of the randomness of a

variable and it measures the level of impurity in a group of examples).

4 Implementation

In this chapter the design and implementation of the financial performance

predictor (FPP) is described.

4.1 Financial Performance Predictor design

The financial performance predictor (FPP) is a prototype tool for prediction

of companies financial performance using machine learning. The flow of how

FPP is used is shown in figure 4.

19

Figure 4: Steps toward financial prediction

The first step is to collect relevant data, in this thesis we use data from

Twitter. In order to detect the sentiment of a tweet or a group of tweets,

we use the bag of word method. The bag of word method focus on the

words or in some cases set of words (a string of words), regardless of the

context of sentence. We use a list of words (from a dictionary) and all words

that are attached to a sentiment. The words are either positive or negative.

In the experiment we have used two different dictionaries one with that is

developed for financial purposes and one more general. The second step is to

count the number of occurrence of each word present in the dictionaries in the

extracted tweets. The result is combined with the ROA for the corresponding

20

time period and included in the feature vectors. In the forth step machine

learning algorithms will be applied on the feature vectors to train a model

to predict if the ROA increases or decreases based on the sentiment of the

tweets. The classification algorithms that we have used to train the model

are Random Forest, Naive Bayes and Adaboost.

4.2 Financial Performance Predictor Implementation

Various programming languages and tools are used in the implementation of

the FPP.

4.2.1 Collecting data

In order to download tweets a web scraper is written in python programming

language. At the first step a web search query will be made by a python

library called selenium [49]. In the second step the HTML contents will be

stored to driver’s page source of a web browser.

In the third step a python library called beautifulsoup [41] is used to

organize and extract the required data from the HTML source.

At the last step the tweets will be saved as a comma separated version

(CSV) file and then stored in a MySQL database to ease the data manage-

ment.

4.2.2 Feature vectors creation

In this thesis a program for creating feature vectors is written in Java. The

program uses the word dictionaries and count the number of occurrence of

each dictionary word in the tweets. The result is stored in a vector. The

format of a feature vector is shown in Figure 5.

21

Figure 5: The format of a feature vector.

The class variable it the company’s performance. The value of class vari-

able is 1 in case of over-performance and 0 in case of under-performance.

5 Experiments and Results

In this section the experimental setup along with the results are described.

The results are further analyzed in Section 6.

5.1 Dataset

Two datasets are used for the experiments. The first dataset denoted as

TWBMW contains tweets where BMW is either mentioned or used in a hash-

tag (#BMW). The second dataset is called TWVW contains tweets where

Volkswagen is either mentioned or used in a hashtag (#Volkswagen). The

two datasets are described in Table 1

Table 1: The datasets used in the experiments.

Dataset Description Size Time period

TWBMW Tweets related to BMW 677596 2007-2015

TWVW Tweets related to Volkswagen 151648 2012-2015

An example of a negative tweet from TWBMW is:

”BMW is ruining the M-division brand by releasing crap like the ”X6 M”

- http://tinyurl.com/cb2nq7”

22

An example of a positive tweet from the same dataset is:

”Track drive reveals excellent balance of the 2015 BMW 228i - Torque

News http://bit.ly/1xk4xj7 - #BMW”

An example of a neutral tweet (neither positive or negative) from the

same dataset:

”mclaren should come back later in the race when ferrari and bmw have

to use the hard tyres hopefully, anyway”

The sentiment of each tweet is determined by counting the occurrence

of positive and negative words. If a tweet contain more positive words than

negative words, the sentiment is considered positive, if there are more neg-

ative words than positive words, the sentiment is considered negative. If a

tweet contain the same amount of positive and negative words the sentiment

is considered to be neutral.

5.2 Quarterly reports

To obtain the value on return on asset (ROA) for each quarter, BMW quar-

terly reports (10-Q reports) are downloaded from [44] and Volkswagen quar-

terly reports are downloaded from [45]. The value of ROA is not explictly

mentioned in the quarterly reports and therefore it is calculated manually

using the value of the total income and and the total assets value. In Table

2 performance of BMW and Volkswagen in different quarter of the year is

shown.

5.3 Dictionaries

We have used two different dictionaries to determine the sentiment of tweets.

The first dictionary (called the regular dictionary) is inspired by the posi-

tive and negative emotions from the tool Linguistic Inquiry and Word Count

23

Table 2: The companies performance based on the ROA.

Year Quarter BMW Volkswagen

2015 Quarter 1 Over-perform Under-perform

Quarter 2 Over-perform Over-perform

Quarter 3 Under-perform Under-perform

2014 Quarter 1 Over-perform Under-perform

Quarter 2 Over-perform Over-perform

Quarter 3 Under-perform Under-perform

2013 Quarter 1 Under-perform Under-perform

Quarter 2 Over-perform Over-perform

Quarter 3 Under-perform Under-perform

2012 Quarter 1 Over-perform Under-perform

Quarter 2 Over-perform Under-perform

Quarter 3 Under-perform Over-perform

2011 Quarter 1 Over-perform —

Quarter 2 Over-perform —

Quarter 3 Over-perform —

2010 Quarter 1 Over-perform —

Quarter 2 Over-perform —

Quarter 3 Over-perform —

2009 Quarter 1 Under-perform —

Quarter 2 Over-perform —

Quarter 3 Under-perform —

2008 Quarter 1 Under-perform —

Quarter 2 Over-perform —

Quarter 3 Under-perform —

2007 Quarter 1 Over-perform —

Quarter 2 Over-perform —

Quarter 3 Over-perform —

(LIWC) [37]. The second dictionary (called the f inancial dictionary) is called

Loughran-McDonald master dictionary[38]. The Loughran-McDonald mas-

24

ter dictionary is an extension of the 2of12inf wordlist that includes an ad-

dition of the words that are appearing in companies annual reports. The

2of12inf is a wordlist from SCOWL (Spell Checker Oriented Word Lists) and

Friends consisting of English words that are useful for creating high-quality

list of words for spell checkers [43].

Table 3: The two different dictionaries and some example words.

Regular dictionary Example

Positive Emotions happy, pretty, good

Negative Emotions hate, worthless, enemy, hurt

Financial dictionary Example

Positive Emotions best, achieve, able

Negative Emotions abandoned, misprice, untrusted

Table 3 shows some sample words from the two different dictionaries we

have used.

5.4 Weka

All experiments are done using Weka [39]. Weka has a collection of data min-

ing algorithms, predictive modeling and tools for visualization and a graph-

ical user interface for ease of access to its functions.

Three different classification algorithms are used in our experiments: Ran-

dom forest, Naive Bayes and AdaBoost. Information Gain feature selection

method is been used for Naive Bayes classifier. For data balancing, the

SMOTE algorithm [36] and Weka Randomize filter are used. The default

settings for each algorithm in Weka are:

• Random Forest: Number Of Trees: 100, Seed = 1.

25

• AdaBoost: Number of Iteration = 10, Seed = 1, Weight Threshold =

100.

• SMOTE: Nearest Neighbor = 5, Percentage (percentage of SMOTE

instances to create) = 100, Random seed = 1.

5.5 Experiments

We have done four different experiments to get an understanding on the

possibilities to predict a company’s performance based on public opinion

extracted from social media. The experiments are different in terms of the

number of feature vectors used, the features and the choice of classifier. All

experiments have the same classifier setup. For each relevant time period,

a number of feature vectors are created from the datasets. For each time

period a variable describing if the company was under-performing or over-

performing (relative to previous quarter) is added. The differences between

the experiments are the number of feature vectors that are created for the

dataset and what dictionary that is used.

The results for the different classifiers are described as confusion matri-

ces in which we present the number of true positives, false negatives, true

negatives, and false positives as illustrated in Table 4.

Predicted class

Actual classTrue Neg. (TN) False Pos. (FP)

False Neg. (FN) True Pos. (TP)

Table 4: Confusion matrix

To evaluate the results we use the measures accuracy, precision, recall

and F-score that can be derived from the confusion matrix.

26

Accuracy is defined as:

TP + TN

TP + FP + TN + FN

precision is defined as:TP

TP + FP

recall as:

TP

TP + FN

and F-score (to measure test’s accuracy) as:

2 ∗ precision ∗ recallprecision+ recall

5.5.1 Experiments with the regular dictionary

Experiment 1: Combined tweets

In the first experiment all tweets that were published during each year’s

quarter are combined and one feature vector representing a quarter of a year

is created. The words in the regular dictionary are used as features together

with a variable representing the total sentiment of the tweets and a variable

that indicates whether the company was over performing or under performing

during specific quarter of the year.

In the experiment, a model was trained and evaluated on 27 instances

using 10-fold cross validation.

Table 5 shows the results for experiment 1 using three different classifiers

and the TWBMW dataset.

Experiment 2: Combined tweets and changes in sentiment

In the second experiment all tweets published during each year’s quarter are

combined and the total sentiment is specified. Feature vectors are created

27

Table 5: The results for experiment 1 using TWBMW dataset.

Dataset Classifier Over-perform Under-perform Accuracy Precision Recall F-Score

TWBMW Random Forest 10 3 74.07% 0.714 0.769 0.74

4 10

TWBMW Naive Bays 10 3 74.07% 0.714 0.769 0.74

4 10

TWBMW AdaBoost 9 4 62.96% 0.6 0.692 0.64

8 6

using the changes of sentiment from one quarter to another. The words in

regular dictionary are used as features together with a variable representing

the total sentiment of the tweets and a variable that indicates whether the

company was over performing or under performing during specific quarter

of the year. In the experiment, a model was trained and evaluated on 27

instances using 10-fold cross validation.

Table 6: The results for experiment 2 using TWBMW dataset.

Dataset Classifier Over-perform Under-perform Accuracy Precision Recall F-Score

TWBMW Random Forest 5 9 25.92% 0.313 0.357 0.33

11 2

TWBMW Naive Bays 13 1 77.77% 0.722 0.929 0.8

5 8

TWBMW AdaBoost 10 4 66.66% 0.667 0.714 0.688

5 8

Experiment 3: One feature vector per 100 tweets

In experiment 3 one feature vector is created per 100 tweets and Y variables

28

of feature vectors are assigned based on their published time. The Y variable

(value to be predicted) is zero if the company is under-performing and one

if the company is over-performing.

In this experiment the data is balanced using SMOTE algorithm and the

randomize algorithm [47]. The randomize algorithm randomly shuffles the

order of instances passed through and is used to prevent over-fitting.

Table 7: The results for experiment 3 using TWBMW dataset.

Dataset Classifier Over-perform Under-perform Accuracy Precision Recall F-Score

TWBMW Random Forest 2564 308 84.61% 0.786 0.893 0.834

698 2968

TWBMW Naive Bays 1947 925 68.09% 0.626 0.678 0.65

1161 2505

TWBMW AdaBoost 1136 1736 62.05% 0.604 0.396 0.478

745 2921

Table 8: The results for experiment 3 using TWVW dataset.

Dataset Classifier Over-perform Under-perform Accuracy Precision Recall F-Score

TWVW Random Forest 604 194 86.17% 0.953 0.757 0.842

30 792

TWVW Naive Bays 567 231 77.22% 0.804 0.711 0.752

138 684

TWVW AdaBoost 448 350 60.86% 0.612 0.561 0.584

284 538

29

5.5.2 Experiments using the financial dictionary

Experiment 4: One feature vector per 100 tweets In the forth exper-

iment one feature vector is created per 100 tweets and Y variables of feature

vectors are assigned based on their published time.

In this experiment in order to balance the data instances, SMOTE and

randomize algorithms are used.

Table 9: The results for experiment 4 using TWBMW dataset.

Dataset Classifier Over-perform Under-perform Accuracy Precision Recall F-Score

TWBMW Random Forest 1261 175 81.12% 0.888 0.758 0.816

442 1390

TWBMW Naive Bays 1110 326 71.54% 0.79 0.67 0.724

604 1228

TWBMW AdaBoost 268 1168 60.67% 0.594 0.93 0.724

117 1715

Table 10: The results for experiment 4 using TWVW dataset.

Dataset Classifier Over-perform Under-perform Accuracy Precision Recall F-Score

TWVW Random Forest 1304 492 83.03% 0.914 0.726 0.808

122 1702

TWVW Naive Bays 1110 686 65.85% 0.669 0.618 0.64

550 1274

TWVW AdaBoost 1629 167 57.59% 0.544 0.907 0.678

1368 456

30

6 Discussion

In the first experiment one feature vector was created for each quarter of the

year, which means 27 data instances in total. Low number of data instances

can be one of the reasons that the accuracy is lower in compare to other

experiments. In the second experiment, instead of counting number of words

and use them as features, the differences of word counts from previous quarter

is used and the prediction accuracy has dropped for random forest algorithm

while it showed a little improvement in other classifiers. The reason for

getting low accuracy with random forest classifier could be that the sentiment

in feature vectors should not be created in relation to other feature vectors.

In the third and forth experiment, one feature vector is created per 100 tweets

and the datasets are balanced, then the prediction accuracy improves. This

could be due to balanced number of instances.

Among all of the experiments that is done, except experiment 2, the most

accurate classifier was Random forest classification algorithm, from the third

experiment which provided 86.17% accuracy in an experiment where 100

tweets from TWVW dataset were combined into one feature vector and the

regular dictionary was used as features.

The best results was obtained when using random forest. Random forest

ranks the variables in the feature vector, and also relation between each

variables while splitting nodes, in order to produce higher accuracy. The

data used to train the random forest classifier was balanced and therefore a

more accurate classification model could be produced.

31

7 Conclusion

Customer’s opinion about products and services is always a concern for most

large-and middle sized companies because it has effects on the company’s

financial performance. Social media is one of the most widely used source of

data about customer’s opinion toward a certain company. We have presented

a machine learning approach toward predicting two companies financial per-

formance using tweets that are related to them from twitter. We use two

different set of features based on two different sentiment analysis dictionar-

ies. Three different classification algorithms (Random forest, Naive Bays and

AdaBoost) are used to find the best model to predict changes of Return on

Assets (ROA) from one quarter to another quarter. Our experiments shows

that with an accuracy of 86.17% tweets can predict whether a company will

over-perform or under perform in the upcoming quarter of the year. However

more research on various companies need to be done in order to find the most

optimal prediction accuracy percentage.

8 Future work

In this thesis, sentiment of twitter and changes of ROA from one quarter of

a year to another quarter have been used to predict financial performance of

a company. Changes of ROA is not the only way to predict the financial per-

formance of a company. There are many different variable and metrics such

as Internal rate of return (IRR), Cash-flow return on investment (CFROI),

Discounted cash flow (DCF) and Return on Equity (ROE) that could also be

used and it would be interesting to investigate possibilities to predict these

metrics as well.

We focused on Twitter in this work but there are many other online

32

forums and social media that may have more effect on companies performance

or reflect the opinion of certain companies user better than Twitter. A

direction for future work would be to investigate other forms of social media

and how well they can predict the performance of a company.

In this work finding we used a bag of words method to detect the senti-

ment of a text. There are many other sentiment analysis methods which can

be used to find sentiment of a text.

In this work the features that we considered consist of word counts only.

There might be many other factors that are important in predicting the

performance of a company. An obvious direction for future work is to extend

the set of features and to do more experiments on different data and on

different companies.

References

[1] Johan Bollen, Huina Mao, Xiaojun Zeng (2011) Twitter mood predicts

the stock market Journal of Computational Science 2, 1–8

[2] Tristan Fletcher, Fabian Redpath and Joe DAlessandro (2009) Machine

Learning in FX Carry Basket Prediction Proceedings of the International

Conference of Financial Engineering, vol. 2, page 1371-1375.

[3] Michael T. Lash and Kang Zhao (2016). Early Predictions of Movie Suc-

cess: the Who, What, and When of Profitability Artificial Intelligence

(cs.AI); Social and Information Networks (cs.SI).

[4] Harald Schoen, Daniel Gayo-Avello, P. Takis Metaxas, Eni Mustafaraj,

Markus Strohmaier (2013) The Power of Prediction with Social Media

Computer Science Faculty Scholarship, Wellesley College.

33

[5] Sheng Yu and Subhash Kak (2012) A Survey of Prediction Using Social

Media Department of Computer Science, Oklahoma State University.

[6] Statistics Portal http://www.statista.com/statistics/272014/

global-social-networks-ranked-by-number-of-users/

[7] Reza Zafarani, Mohammad Ali Abbasi, Huan Liu (2014) Social Media

Mining Cambridge University.

[8] Marta Zembik (2014) Social media as a source of knowledge for customers

and enterprises Online Journal of Applied Knowledge Management, Vol-

ume 2, Issue 2

[9] Tim Loughran and Bill McDonald (2011) When Is a Liability Not a Lia-

bility? Textual Analysis, Dictionaries, and 10-Ks The Journal of Finance,

Vol. LXVI, NO. 1

[10] Alexander Pak, Patrick Paroubek Twitter as a Corpus for Sentiment

Analysis and Opinion Mining In LREC Vol. 10, pp. 1320–1326.

[11] Mittal and Goel (2012). Stock Prediction Using Twitter Sentiment Anal-

ysis Project report.

[12] Sahar Nassirpour, Parnian Zargham, Reza Nasiri Mahalati (2012). Elec-

tronic Devices Sales Prediction Using Social Media Sentiment Analysis

Project report Stanford university.

[13] Opinion Finder http://mpqa.cs.pitt.edu/opinionfinder/

[14] Harvard IV-4 dictionary http://www.wjh.harvard.edu/~inquirer/

homecat.htm

34

[15] Definition of ’10-K’ http://www.investopedia.com/terms/1/10-k.

asp

[16] TreeTagger’ http://www.cis.uni-muenchen.de/~schmid/tools/

TreeTagger/

[17] Douglas M. McNair, Maurice Lorr, and Leo F. Droppleman (1971).Man-

ual for the Profile of Mood States San Diego, CA: Educational and In-

dustrial Testing Service.

[18] Walaa Medhat, Ahmed Hassan, Hoda Korashy (2014). Sentiment anal-

ysis algorithms and applications: A survey Ain Shams Engineering Jour-

nal.

[19] C. W. J. Granger (1969). Investigating Causal Relations by Econometric

Models and Cross-spectral Methods Econometrica Vol. 37, No. 3 (Aug.,

1969), pp. 424-438.

[20] Richard Socher Jeffrey Pennington, Eric H. Huang Andrew, Y. Ng

Christopher, D. Manning (2011). Semi-Supervised Recursive Autoen-

coders for Predicting Sentiment Distributions Proceedings of the Confer-

ence on Empirical Methods in Natural Language Processing Pages 151-

161.

[21] JAN A. EKLOF, PETER HACKL, ANDERS WESTLUND (2009). On

measuring interactions between customer satisfaction and financial results

TOTAL QUALITY MANAGEMENT Pages 514-522.

[22] Return on Assets http://www.investopedia.com/terms/r/

returnonassets.asp

35

[23] Eugene W.Anderson, Claes Fornell, Ronald T.Rust (1997). Customer

Satisfaction, Productivity, and Profitability: Differences Between Goods

and Services Marketing Science Pages 129-145.

[24] Dan Zarrella. (2009). The social media marketing book. OReillyMedia,

Inc.

[25] Andreas M. Kaplan, Michael Haenlein (2009). Users of the world, unite!

The challenges and opportunities of Social Media ESCP Europe, 79 Av-

enue de la Rpublique, F-75011 Paris, France.

[26] Weka Data Mining http://weka.wikispaces.com/

[27] Subhabrata Mukherjee. (2012). Sentiment analysis. Indian Institute of

Technology, Bombay Department of Computer Science and Engineering.

[28] Bing Liu. (2012). Sentiment analysis and opinion mining. Claypool Pub-

lishers.

[29] John Hagel III, John Seely Brown and Lang Davison. (2010). The

Best Way to Measure Company Performance https://hbr.org/2010/

03/the-best-way-to-measure-compan

[30] Karina Gibert, Miquel Snchez-Marr, Vctor Codina. (2010). Principles of

Accounting. International Environmental Modelling and Software Society

(iEMSs).

[31] Belverd E.Needles, Marian Powers, Susan V. (2014). Principles of Ac-

counting South-Western Cengage Learning.

[32] Yoav Freund Robert E. Schapire. (1996). Experiments with a New Boost-

ing Algorithm Machine Learning: Proceedings of the Thirteenth Interna-

tional Conference.

36

[33] Russell Stuart, Norvig Peter. (2003). Artificial Intelligence: A Modern

Approach. Prentice Hall. ISBN 978-0137903955.

[34] Carol McDonald. (2015). Parallel and Iterative Processing for Machine

Learning Recommendations with Spark https://www.mapr.com/blog/

parallel-and-iterative-processing-machine-learning-recommendations-spark

[35] Rokach, Lior; Maimon, O. (2008). Data mining with decision trees: the-

ory and applications. World Scientific Pub Co Inc. ISBN 978-9812771711.

[36] Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, W. Philip

Kegelmeyer. (2002). SMOTE: Synthetic Minority Over-sampling Tech-

nique. Journal of Artificial Intelligence Research, page 321357.

[37] Linguistic Inquiry and Word Count http://liwc.wpengine.com/

[38] 2014 Master Dictionary http://www3.nd.edu/~mcdonald/Word_

Lists.html

[39] Weka 3: Data Mining Software in Java http://www.cs.waikato.ac.

nz/ml/weka/index.html

[40] Stehman, Stephen V. (1997). Selecting and interpreting measures of the-

matic classification accuracy. Remote Sensing of Environment, p7789.

[41] Beautiful Soup Documentation https://www.crummy.com/software/

BeautifulSoup/bs4/doc/

[42] Sylvain Arlot. (2004). A survey of cross-validation procedures for model

selection. Journal of Machine Learning Research , p1089-1105.

[43] Release 4 of the 12dicts word lists http://wordlist.aspell.net/

12dicts-readme-r4/

37

[44] BMW Quarterly Reports https://www.bmwgroup.com/en/

investor-relations/financial-reports.html

[45] Volkswagen Quarterly Reports http://quicktake.morningstar.com/

stocknet/secdocuments.aspx?symbol=vlkay

[46] SCOWL (And Friends) wordlist http://wordlist.aspell.net/

[47] Class Randomize http://weka.sourceforge.net/doc.dev/weka/

filters/unsupervised/instance/Randomize.html

[48] Shareholders’ Equity http://www.investopedia.com/terms/s/

shareholdersequity.asp

[49] Selenium with Python http://selenium-python.readthedocs.io/

38