Self-taught learning: Implementation using MATLAB

SZAKDOLGOZAT

Bod Gergely

Debrecen

2013

Debreceni Egyetem

Informatikai Kar

Self-taught learning:

Implementation using MATLAB

Tmavezet: Dr. Antal Blint Ksztette: Bod Gergely

Beosztsa: Adjunktus Szak megnevezs:

programtervez informatikus

Table of contents

1. Machine learning primer

a) Supervised versus Unsupervised learning

b) Neural networks

2. Self-taught learning

a) Introduction

b) Representation learning

3. Implementation

a) Initialization

b) Setting up the environment for learning

c) Feature learning

d) Classification

e) Image database construction

f) Enhancements of the standard algorithm

g) Experiments

4. Conclusion

5. Acknowledgments

1

1. Machine learning primer

In this thesis I would like to present a relatively new concept that has emerged from machine

learning. Before going into specific details regarding this new framework it is advisable to

spend some time by elaborating what the field of machine learning deals with. Machine

learning is universally defined as the construction of intelligent computer programs that can

learn and with time improve its performance on some task (Mitchell, 1997).

Machine learning is often categorized as branch of computer science and is usually considered

a subfield of artificial intelligence. While this categorization places machine learning strictly

into the domain of computing it is fair to mention other disciplines that might be equally if not

more influential to the development of this science. First of all machine learning employs a

plethora of methods, technical tools from statistics, mathematics, neuroscience, biological

systems. One particular subfield of mathematics, namely optimization is a very important tool

for broad applications in machine learning. Statistics has many overlapping features with

machine learning, concepts, tools, methods but there seem to be one distinctive feature. While

statistics emphasizes inference, the primary goal of machine learning is prediction.

Typical industrial applications of ML include spam filtering, handwritten character

recognition, image classification. To give a gentle example consider the following from

Alpaydn: For some tasks, however, we do not have an algorithmfor example, to tell spam

emails from legitimate emails. We know what the input is: an email document that in the

simplest case is a file of characters. We know what the output should be: a yes/no output

indicating whether the message is spam or not. We do not know how to transform the input to

the output. What can be considered spam changes in time and from individual to individual.

What we lack in knowledge, we make up for in data. We can easily compile thousands of

example messages some of which we know to be spam and what we want is to learn what

constitutes spam from them. In other words, we would like the computer (machine) to extract

automatically the algorithm for this task. There is no need to learn to sort numbers, we

already have algorithms for that; but there are many applications for which we do not have

an algorithm but do have example data...Think, for example, of a supermarket chain that has

2

hundreds of stores all over a country selling thousands of goods to millions of

customers...What the supermarket chain wants is to be able to predict who are the likely

customers for a product. Again, the algorithm for this is not evident; it changes in time and by

geographic location.1

The previous citation gives a good understanding when machine learning can be employed.

We do not have a rigid algorithm but we have a massive amount of data. This data can

compensate us for not having an exact algorithm and by discovering certain statistical,

mathematical patterns in the data we can construct different learning strategies.

a) Supervised versus Unsupervised learning

There is a particular dimension in which we can differentiate the learning algorithms. It is the

absence or the existence of the labels we provide with each incoming example. Imagine a

dataset which consists of images of cars and cats. The individual images have labels attached

to them describing which category (car or cat) they belong to. We could train the system by

splitting the dataset into two parts: a) training set b) test set. By first running the algorithm on

the training set the algorithm can make certain corrections to the parameters of the underlying

model whenever it sees an example and told what category it belonged to. This way it is

possible to devise such a system that can predict at a certain statistical predictive power that a

fresh example that has not been observed by the algorithm before which category belongs to.

In contrast when employing unsupervised learning we are not providing the labels (concretely

the metadata about the category) of the examples. In this particular case we are trying to find

some hidden pattern in the underlying data. It is very well summarized by Alpaydin: In

supervised learning, the aim is to learn a mapping from the input to an output whose correct

values are provided by a supervisor. In unsupervised learning, there is no such supervisor and

we only have input data.

The aim is to find the regularities in the input. There is a structure to the input space such that

certain patterns occur more often than others, and we want to see what generally happens

and what does not. In statistics, this is called density estimation.2

1Ethem Alpaydn, Introduction to Machine Learning, 2nd ed, The MIT Press Cambridge, Massachusetts

London, England, 2010, pages 1-2 2Ethem Alpaydn, Introduction to Machine Learning, 2nd ed, The MIT Press Cambridge, Massachusetts

London, England, 2010, page 11

3

b) Neural networks

Since the algorithm devised by Andrew Ng at Stanford University relies heavily on the use of

neural networks (I would even say that they are the workhorse of the algorithm) I feel the

need to spend some time to explain what they are and when they are useful. The idea of a

neural network was inspired by a biological construct, namely the brain. The brain consists

millions of neurons that form a very complex information processing and forwarding system.

Neural networks consist of simple processing units that interact via weighted connections.

They are sometimes designed in hardware but most research nowadays involves software

simulations. They were originally inspired by ideas about how the brain makes its

computations (Hinton, 1999).

A typical processing unit first computes a total input which is a weighted sum of the

incoming values from other units plus a bias term. The next step is to put its total input

through an activation function to calculate the activity of the unit. One of the most common

activation function is the sigmoid where y = 1/(1+exp(-x)) (Hinton, 1999).

One of the most interesting properties of neural networks is their ability to learn from

examples by adapting the weights on the connections. The most widely adopted machine

learning algorithms are supervised: they assume that there is a set of training examples, each

consisting of an input vector and a desired output vector. Learning involves sweeping through

the training set couple of times and gradually adjusting the weights so that the actual output

produced by the network gets closer to the desired output. The simplest neural network

architecture is built of some input units with directed, weighted connections to an output unit.

(Hinton, 1999)

4

By the introduction of so called hidden layers neural networks can express complicated

nonlinear units between the input and the output. Finding the optimal weights is generally

computationally impossible but gradient methods can be effectively used to find sets of

weights that work well for many practical real life tasks.

An algorithm called back propagation (Rumelhart et al., 1986) can be used to compute the

derivatives with respect to each weight in the network of the error function. The standard

error function is the squared difference between the actual and the desired outputs (Hinton,

1999).

Source: MathWorks, Neural Network Product help

Illustration B: Describes a single neuron with input p and weight w.

A. Illustration

Source: The MathWorks Inc, Neural Network Product help

5

For each training case the activities of the units are calculated by a forward pass through the

network. Then starting with the output units a backward pass is done through the network to

compute the derivatives of the error function with respect to the total input received by each

unit (Hinton, 1999).

Below one can see a graphical representation of the one-layer neural network architecture.

C. Illustration


Illustration C depicts a one-layer network with R input elements and S neurons.

In this network, each element of the input vector p is connected to each neuron input through

the weight matrix W. The ith neuron has a summer function that gathers its weighted inputs

and bias to form its own scalar output, n(i). The various n(i) taken together form an S-element

net input vector, n. Finally, the neuron layer outputs form a column vector a. The expression

for a is shown at the bottom of the picture. (MathWorks, Neural Network Product help)

Note that it is not uncommon for the number of inputs to a layer to be different from the

number of neurons (i.e., R is not necessarily equal to S). A layer is not limited to have the

number of its inputs equal to the number of its neurons. (MathWorks, Neural Network

Product help)

A network is not constrained to one hidden layer. One can stack multiple layers onto each

6

other where each layer's input will become the output of the previous one.


Illustration D: Matrix (IW1,1) having a source 1 (second index) and a destination 1 (first

index). Elements of layer 1, such as its bias, net input, and output have a superscript 1 to say

that they are associated with the first layer.

2. Self-taught learning

In this chapter I will introduce the idea of a new framework postulated by Raina et al. in their

paper (Raina et al., 2007.).

a) Introduction

This new framework has the potential to use unlabeled data to enhance supervised

classification tasks. They (Raina et al., 2007) do not assume that the unlabeled data follows

the same class labels or distribution as the labeled dataset. Consequently they were able to use

a large number of unlabeled images or other input types like audio or text that are downloaded

randomly from the Internet. This huge unlabeled dataset is to be used to enhance the

performance of the supervised classification task. Since such unlabeled data is much easier

D. Illustration

7

and cheaper to be obtained than labeled data, it is possible to apply self-taught learning to

broad range of practical problems. Their approach uses sparse coding to construct higher-

level features using the unlabeled data. These features form a compact input representation

and in theory can significantly improve the classification performance. (Raina et al., 2007)

Their approach is motivated by the observation that even many randomly downloaded images

will contain basic visual patterns (such as edges) that are similar to the images to be originally

classified. Therefore, we can learn to recognize such patterns from the unlabeled data, these

patterns can be used for the supervised learning task of interest. (Raina et al, 2007)

They make the distinction from earlier method semi-supervised learning by stating: the

unlabeled data does not share the class labels or the generative distribution of the labeled data.

For example, given unlimited access to natural sounds (audio), can we perform better speaker

identification?3 (Raina et al, 2007)

The similarity with semi-supervised learning (Nigam et al., 2000) is that both use labeled and

unlabeled data for the classification task. But unlike semi-supervised learning their new

method they propose that we do not assume that the unlabeled data can be assigned to the

supervised learning tasks class labels.3

The consequence is that acquiring unlabeled images is far easier than to acquire the same

amount of labeled data from a certain category since it is perfectly sufficient to randomly

download for example 100,000 images from the Internet.

Their argument also enjoys a biological motivation. It has been long assumed by

neuroscientists that most human learning is performed by an unsupervised fashion.

Their approach was split to the following two stages: First we learn a representation using

only unlabeled data. Then, we apply this representation to the labeled data, and use it for the

classification task. Once the representation has been learned in the first stage it can then be

applied repeatedly to different classification tasks.3

b) Representation learning

Much of our machine learning tasks are hindered by the fact that it is often very difficult to

recognize the underlying factors that contribute the most explanatory power. Nowadays a

common method is to manually preprocess the data by discovering the most fundamental

3 Rajat Raina, Alexis Battle, Honglak Lee, Benjamin Packer, Andrew Y. Ng, Self-taught Learning: Transfer

Learning from Unlabeled Data, 2007, pages 1- 2

8

factors and transform the data so a traditional classifier etc. can solve the problem at hand

effectively but it often requires industry specific knowledge. Feature engineering like this is

important but also has the weakness that it is labor extensive and difficult to apply in a general

context (Bengio, 2012). Recent advances in machine learning concentrates on representation

learning which can broaden the applicability of our already known algorithms to

classify/predict on the data. Self-taught learning holds the promise by incorporating several

methods from representation learning, like sparse auto-encoders, deep networks by

automating the process of feature extraction and augmenting the process of the final

classification/prediction task.

3. Implementation

At the beginning of my research I have decided to start by implementing the proposed

algorithms presented on the website of the Stanford class, UFLDL (Unsupervised Feature

Learning and Deep Learning4 where some framework related, mostly initialization code was

available and I have included them in my implementation. I have coded most of the

algorithms both in MATLAB and Numpy, Scipy (Numpy and Scipy are extensions to the

Python language for numerical computations). My observation was that MATLAB provided

me a much better environment for rapid development and prototyping. The convenience of

the debugger in MATLAB made a big impact on the speed of my algorithm development.

Considering the above I have decided to provide the code snippets in MATLAB instead of

Python (some of the utility scripts were nevertheless scripted in Python).

a) Initialization

The MATLAB environment I have created consists a top-level script file called run.m that has

the job to start the profiling of the whole algorithm and to kick-off the several stages of the

process. Lets take a look at the code:

4 http://ufldl.stanford.edu/wiki/index.php/UFLDL_Tutorial

9

run.m

start = tic;

linearDecoder;

cnn;

elapsed = toc(start);

fprintf('Elapsed time: %d', elapsed);

fprintf('Done.');

By calling the tic instruction we tell the environment to assign to the start variable the current

time. It will be used after the algorithm finished by calling the toc function by passing the

start variable and the result will be stored in the elapsed variable which we can use to print

the elapsed number of seconds.

b) Setting up the environment for learning

At this point we should step into the linearDecoder.m script file that consists several tasks.

Specifically it needs to:

Initialize the architecture and the specific tasks related parameters

Optionally create the image batches

Apply ZCA whitening

Define the number of image patches to work with

Randomly initialize the weights of the neural network

Start the optimization of the the neural networks cost function

Save the optimal weights that will become our optimal features we can use for

representation

Visualize the learned features

Now lets look at the parameter initialization section:

imageChannels = 3; % number of channels (rgb, so 3)

patchDim = 8; % patch dimension

numPatches = 100000 % number of patches

10

visibleSize = patchDim * patchDim * imageChannels; % number of input units

outputSize = visibleSize; % number of output units

hiddenSize = 400; % number of hidden units

sparsityParam = 0.035; % desired average activation of the hidden units.

lambda = 3e-3; % weight decay parameter

beta = 5; % weight of sparsity penalty term

epsilon = 0.1; % epsilon for ZCA whitening

I set the number of image channels to three since we will be processing colored - RGB

pictures where we store an image in a 3 dimensional matrix where one dimension holds the

values for the red-green-blue components only. The variable numPatches holds the value of

the number of patches we would like to extract from our unlabeled image dataset. I set the

size of one patch to a 8 x 8 matrix by specifying patchDime to be eight. We need to take into

account how much RAM memory our system has. By setting the number of patches to

100,000 we will be allocating a total of (8 x 8 x 3) x (100000) number of double words which

is a total of ~153 MB.

In the next section we calculate the number of input and output units. In our case it will be

now 192, since we have 8 x 8 x 3 = 192 number of incoming value on one image patch

example. We set the number of hidden units to 400.

By setting the sparsity parameter in the next line to a close to zero number we are eventually

driving the neural network to make most of its hidden units inactive since we want only a few

number of nodes to be active, essentially in firing mode. The lambda and beta will have an

impact on the behaviour of the neural networks cost function.

Let me proceed further in linearDecoder.m:

batchSize = 25000 % should be less than the number of patches

patches = sampleIMAGES_fromBatches(C:\folder, numPatches);

save 'patches_imageNet.mat' patches

The variable batchSize we specify will tell the algorithm later in how many batches the

optimization of the cost function should happen. This is required for limited-memory

11

processing since one does not need to load all the data this way into RAM. In our case we

have 100,000 sample patches and this enables us to divide the processing into four seperate

iterations, consequently putting a lot less strain on the main memory.

The sampleIMAGES_fromBatches function takes three arguments. The first one will be a

string denoting the folder consisting our saved batches of images. Let me first show the

MATLAB code to create these image batches.

saveImagesToFiles.m

function IMAGES = saveImagesToFiles(pathFromDir,saveDir,row, col)

numOfImagesInOneBatch = 500;

fileFolder = fullfile(pathFromDir);

dirOutput = dir(fullfile(fileFolder,'*.jpg'));

fileNames = {dirOutput.name}';

numOfImages = numel(fileNames);

I = imread(fileNames{1});

numOfBatches = floor(numOfImages / numOfImagesInOneBatch);

numOfRemainder = mod(numOfImages, numOfImagesInOneBatch);

fprintf('Number of images to save: %d\n', numOfImages);

fprintf('Number of images in one batch: %d\n', numOfImagesInOneBatch);

fprintf('Number of batches: %d\n', numOfBatches + 1);

numberOfImagesSaved = 0;

for n=1:numOfBatches

batchName = strcat('imageBatch_', num2str(n));

% Preallocate the batch

IMAGES = zeros([row col 3 numOfImagesInOneBatch],class(I));

for i=1:numOfImagesInOneBatch

currentImage = (n-1)*numOfImagesInOneBatch + i;

12

I = imread(fileNames{currentImage});

if not(numel(size(I)) == 3)

fprintf(fileNames{currentImage})

continue

end

IMAGES(:,:,:,i) = I;

numberOfImagesSaved = numberOfImagesSaved + 1;

fprintf('Loaded image %d in batch %d. Global image number: %d\n', i, n,

numberOfImagesSaved);

end

save(strcat(saveDir,'/', batchName, '.mat'), 'IMAGES');

fprintf('Saved batch number %d out %d\n', n, numOfBatches + 1 );

clear IMAGES;

end

if not(numOfRemainder == 0)

batchName = strcat('imageBatch_', 'remainder_', num2str(numOfRemainder));

% Preallocate the batch

IMAGES = zeros([row col 3 numOfRemainder],class(I));

for i=1:numOfRemainder

currentImage = numOfBatches*numOfImagesInOneBatch + i;

I = imread(fileNames{currentImage});

if not(numel(size(I)) == 3)

fprintf(fileNames{currentImage})

continue

end

IMAGES(:,:,:,i) = I;

numberOfImagesSaved = numberOfImagesSaved + 1;

fprintf('Loaded image %d in batch %d. Global image number: %d\n', i, n,

numberOfImagesSaved);

end

save(strcat(saveDir,'/', batchName, '.mat'), 'IMAGES');

fprintf('Saved the remainder batch with %d images in it.\n', numOfRemainder);

clear IMAGES;

end

13

We pass the directory consisting our original images as the first parameter. The second

parameter specifies the folder to save the transformed and packaged images to. The row, col

parameters will denote the dimension into we would like to transform our images to. We can

specify in the function how many images we want to have in on batch file. There is always a

trade-off when working with batches of data. The more batches we have the less need we

have for memory but on the other hand it will require many I/O instructions to move the data

from disk to RAM and the code readabilty suffers too. In case we have a plethora of memory

we should set this number to a higher one so less I/O instructions will need to be performed

but it will incur a higher pressure on the main memory.

In the next steps we calculate the number of batches required to hold all the images with the

given extension from the given folder. In the for loop we start putting together the batches by

reading each image into memory by calling the imread built-in MATLAB function. The

dimension of the batch matrices are the following: we need the number of rows, columns,

color channels (three) and the number of images to be stored in one batch. As exception

handling I am guarding against invalid images that do not adhere to having a three

dimensional structure by simply skipping and logging the name of the particular image.

After this short diversion lets get back to our function, sampleIMAGES_fromBatches. Here

the first parameter denotes the folder where we now store the images that we have packaged

into batch files. We tell how many patches we want to create, which in our case is 100,000.

patchsize = 8; % Use 8x8 patches

numpatches = numberOfPatchesToCreate;

numOfChannels = 3;

% Initialize patches with zeros.

patches = zeros(patchsize*patchsize*numOfChannels, numpatches);

fileFolder = fullfile(directoryOfBatches);

dirOutput = dir(fullfile(fileFolder,'*.mat'));


numOfBatches = numel(fileNames);

14

samplesCreated = 0;

for n = 1:numOfBatches

load(fileNames{n});

fprintf('Sampling from batch: %s\n', fileNames{n});

numOfImagesInCurrentBatch = size(IMAGES,4);

batchSize = numberOfPatchesToCreate/numOfBatches;

for i = 1:batchSize

randImg = randi(numOfImagesInCurrentBatch);

patchRowStart = randi(505);

patchRowRange = patchRowStart:patchRowStart+patchsize-1;

patchColumnStart = randi(505);

patchColumnRange = patchColumnStart:patchColumnStart+patchsize-1;

patchSample = zeros(patchsize*patchsize*numOfChannels,1);

for j=1:numOfChannels

patchSample((j-1)*patchsize*patchsize+1:(j*patchsize*patchsize),1) =

reshape(IMAGES(patchRowRange,patchColumnRange,j,randImg),patchsize*patchsize,1);

end

patches(:,(n-1)*batchSize + i) = patchSample;

samplesCreated = samplesCreated + 1;

end

end

fprintf('Number of samples created: %d\n', samplesCreated);

In the above code excerpt we are randomly selecting 8 x 8 x 3 dimensional patches that we

are going to feed into our neural network. The return value will be a matrix consisting the

randomly sampled 100,000 patches each having a size of 192 values.

Now going back to the linearDecoder.m:

patches = sampleIMAGES_fromBatches(C:\folder, numPatches);

save 'patches_imageNet.mat' patches

15

We now have the sampled patches (192 x 100,000) in the patches variable. For convenience it

is recommended to save this matrix into a MATLAB file making it persistent on disk in case

we want to rerun the algorithm.

Proceed forward in linearDecoder.m:

% Subtract mean patch

meanPatch = mean(patches, 2);

patches = bsxfun(@minus, patches, meanPatch);

% Apply ZCA whitening

sigma = patches * patches' / numPatches;

[u, s, v] = svd(sigma);

ZCAWhite = u * diag(1 ./ sqrt(diag(s) + epsilon)) * u';

patches = ZCAWhite * patches;

In the next step we are going to normalize the values of the patches by subtracting the mean

values from each patch then apply ZCA whitening.

c) Feature learning

Now we got to the exciting part. This section does the heavy lifting by optimizing the weights

of the neural network to yield the feature activation vector. Lets go back to our code:

theta = initializeParameters(hiddenSize, visibleSize);

The initializeParameters function takes two incoming arguments, namely the number of

hidden units and the number of input/output units, these are one of the most important

characteristics of the architecture of any neural network.

initializeParameters.m

function theta = initializeParameters(hiddenSize, visibleSize)

%% Initialize parameters randomly based on layer sizes.

r = sqrt(6) / sqrt(hiddenSize+visibleSize+1); % we'll choose weights uniformly from the

interval [-r, r]

16

W1 = rand(hiddenSize, visibleSize) * 2 * r - r;

W2 = rand(visibleSize, hiddenSize) * 2 * r - r;

b1 = zeros(hiddenSize, 1);

b2 = zeros(visibleSize, 1);

% Convert weights and bias gradients to the vector form.

% This step will "unroll" (flatten and concatenate together) all

% parameters into a vector

theta = [W1(:) ; W2(:) ; b1(:) ; b2(:)];

In the above code snippet we will get a randomly initialized vector of size 154,192 where the

hidden size is 400 and visible size is 192 (which is the dimension of a single image patch).

Now we are ready to look at the code where we are going to conduct the feature

optimization step.

% Use minFunc to minimize the function

addpath minFunc/

options = struct;

options.Method = 'lbfgs';

options.maxIter = 200

options.display = 'on';

[optTheta, cost] = minFunc( @(p) sparseAutoencoderLinearCostBatch(p, ...

visibleSize, hiddenSize, ...

lambda, sparsityParam, ...

beta, patches, batchSize), ...

theta, options);

% Save the learned features and the preprocessing matrices

fprintf('Saving learned features and preprocessing matrices...\n');

save('imagenet_Features.mat', 'optTheta', 'ZCAWhite', 'meanPatch');

fprintf('Saved\n');

17

We are registering to MATLAB the path to our optimizer which is provided on the Stanford

class site5. This optimizer is capable of conducting a flavour of the limited memory BFGS

optimization. When I experimented with one of MATLABs built-in optimizer (fminunc) I

had memory problems since it was impossible to obtain the Hessian matrix of such a huge

vector (154,192 free variables). A limited BFGS optimizer tries to approximate the Hessian

matrix without building the whole matrix in memory. As a side note I would like to point out

that I found that Scipy provides the minimize function in its optimization library that can do l-

bfgs out of the box.

The minFunc optimizer takes a function pointer to the cost function to be minimized along

with the variables (in this case the features, theta) and the options that governs the behaviour

of the optimizer. The function will yield the optimized variables and the cost associated with

those variable values.

After we have obtained those values, it is again recommended to save the optimal features to

the disk so they are available the next time for progression in experimenting.

Now we have reached one of the most important algorithm in this thesis. Namely the cost

function of the neural network that will act as an auto-encoder that has the task of producing a

good approximation of the input values by devising the optimal feature activations. It is time

for code excerpts once again.

function [cost,grad] = sparseAutoencoderLinearCostBatch(theta, visibleSize, hiddenSize,

lambda,sparsityParam,beta,data,batchSizeParam)

The above represents the signature of the cost function that will return upon completion the

cost, the gradients which will be used by the optimizer.

Let me enumerate the parameters of this function:

theta: the weight values to be used by the neural network (this is to be optimized)

visibleSize: number of nodes in the input/output layer

hiddenSize: number of nodes in the hidden layer

lambda: weight decay parameter

sparsityParam: this poses an incentive on the hidden unit average activations to

converge to this value

beta: weight of the sparsity penalty

5 http://ufldl.stanford.edu/wiki/index.php/UFLDL_Tutorial

18

data: the matrix consisting the training data (in our case a 192 x 100,000 matrix)

As a first step we distribute the variables in the theta vector among the weight parameters of

the neural network.

W1 = reshape(theta(1:hiddenSize*visibleSize), hiddenSize, visibleSize);

W2 = reshape(theta(hiddenSize*visibleSize+1:2*hiddenSize*visibleSize), visibleSize,

hiddenSize);

b1 = theta(2*hiddenSize*visibleSize+1:2*hiddenSize*visibleSize+hiddenSize);

b2 = theta(2*hiddenSize*visibleSize+hiddenSize+1:end);

Lets proceed forward:

numOfExamples = size(data,2);

batchSize = batchSizeParam;

numOfBatches = numOfExamples/batchSize;

hwb = zeros(visibleSize,batchSize);

a2 = zeros(hiddenSize, batchSize);

pj = zeros(hiddenSize,batchSize);

summa = 0;

batchStart = 1;

batchEnd = batchSize;

for t=1:numOfBatches

%forward propagation optimized with vectorization

z2 = W1 * data(:,batchStart:batchEnd) + repmat(b1,1,batchSize);

a2 = sigmoid(z2);

z3 = W2 * a2 + repmat(b2,1,batchSize);

hwb = z3; % makes it a LINEAR DECODER!

... to be continued.

First we assign the number of examples we are providing to the variable numOfExamples. We

assign the number of examples in one batch to the variable batchSize which we chose to be

25,000. I found this to be a good number on a medium priced PC where I conducted my

experiments since having a total of 100,000 examples will require four iterations. We are

faced with a trade-off again here as with any batch processing. MATLAB has very powerful

linear algebra libraries that can exploit vectorized form calculations very efficiently which

makes a strong incentive to pass a matrix with all the examples in it. But that might be

impossible or might put enormous pressure on the memory system. On the other hand using

smaller batches of the examples instead of a big one, we can make the optimizer converge

19

faster to the optimal value and also put less pressure on RAM. I found the number of 25,000

example size in one batch a good compromise.

The next step in our calculations is to start forward propagation. Using the weights of the

input hidden layer we can determine the hidden unit activations a2 on each example in the

given batch. Please note that here we are using vectorized notation which is a first-line

optimization method, practically necessary in MATLAB. Unfortunately, it makes the

readability of the code much more obscure but the speed gains are worth it in almost every

scenario. At the time of rapid prototyping I used for loops which called the MATLAB

interpreter and resulted in a serious degradation in performance.

We are using the sigmoid function to decide on a node being active or not. The sigmoid

function has the property that it maps 0 to 0.5 but coverges to +1 very rapidly when the input

gets relatively large and coverges to zero as the input gets a relatively large negative value.

The MATLAB code to compute the sigmoid function:

function sigm = sigmoid(x)

sigm = 1 ./ (1 + exp(-x));

end

Source: Mathworks, Product help



The next step in the calculation is to have the activations performed on the output units where

the chosen method is to use a linear activation function instead of a sigmoid since a linear

20

function will not constrain the output values to the [0,1] interval as the sigmoid does. Specifi-

cally our linear function is the identity function that we apply on z3 that we can assign to the

output unit variable, hwb right away.

Lets proceed forward. Now compute the squared error term of the cost function where we

are penalizing any high deviation of the output activation, hwb from the original input varia-

ble, data. The incentive of the neural network becomes that the target value should be equal to

the input.

for l=1:batchSize

diff = hwb(:,l) - data(:,(t-1)*batchSize+l);

summa = summa + (diff'*diff) * 0.5;

end

The next line will assign the current activations to pj which we are using to accumulate the

sum of the activations of each example so we will be able to calculate an average activation

per node.

pj = pj + a2;

The next two lines represent simply a housekeeping part where we increment the variable

holding the current examples to be processed. It is also the end of the for loop that iterates

through each batch in order to complete the feed forward step.

batchStart = batchStart + batchSize;

batchEnd = batchEnd + batchSize;

end % end of 'for t=1:numOfBatches'

As can be seen in the next line we calculate the average activations of the output unit.

pj = sum(pj,2) * numOfExamples^-1;

Then we have our cost term by averaging over all of the examples:

cost = summa / numOfExamples;

In the next code section we can see how the weight decay term is calculated:

W2Sum = sum(sum(W2 .* W2));

21

W1Sum = sum(sum(W1 .* W1));

cost = cost + (lambda/2) * (W1Sum + W2Sum);

The vectorized notation obscures the code again. The weight decay term acts as a normaliza-

tion agent that has the purpose to avoid over-fitting. Lets say we have our cost function

measuring the error term E(w) given w, denoting the weight matrix. Then we should use

Emod(w) = E(w) + (lambda/2) * w^2. It has the effect of penalizing large weights.

We should proceed now to the next step:

p = sparsityParam;

sumKL = sum(p*(log(p./pj)) + (1-p)*log((1-p) ./ (ones(hiddenSize,1)-pj)));

cost = cost + beta * sumKL;

In the above code snippet we are making use of our desired sparsity parameter of the activa-

tions. We are effectively penalizing any deviation from our chosen sparsity value and assign-

ing that penalty to the cost function. The method of calculating this penalty is based on the

Kullback-Leibler divergence which measures the difference between two probability distribu-

tions. It tries to measure the information loss that occurs when we are trying to approximate or

substitute one distribution with a different distribution. Here the distributions have the mean

of p and pj where p represents our desire and pj is the actual mean.

Finally we calculate the sparsity term by:

sparsityTerm = beta*( (-1*p)./pj + (1-p)./(ones(hiddenSize,1)-pj) );

Now we got to the point that we are ready to apply backward-propagation:

% HERE CALCULATE IN A LOOP THE FEEDFORWARD ACTIVATIONS AGAIN!

batchStart = 1;

batchEnd = batchSize;

for t=1:numOfBatches

z2 = W1 * data(:,batchStart:batchEnd) + repmat(b1,1,batchSize);

a2 = sigmoid(z2);



errorOutput = (-1*(data(:,batchStart:batchEnd) - hwb)); % in case a linear decoder the

derivOutput is not needed since it's derivative is 1

derivHidden = a2 .* (ones(hiddenSize, batchSize) - a2);

% add the sparsity term!

22

errorHidden = (W2'*errorOutput + repmat(sparsityTerm, 1,batchSize) ).*derivHidden;

W2grad = W2grad + errorOutput * a2';

b2grad = b2grad + sum(errorOutput,2);

W1grad = W1grad + errorHidden * data(:,batchStart:batchEnd)';

b1grad = b1grad + sum(errorHidden,2);

batchStart = batchStart + batchSize;

batchEnd = batchEnd + batchSize;

end

We are required to calculate the feed-forward activations once again due to our batch pro-

cessing solution. It incurs a performance penalty on our algorithm but makes it possible to

operate on machines with limited memory resources. By applying the backward-propagation

we have a very efficient way to calculate the derivatives of our cost function. We are effec-

tively calculating the error terms between our activations and the desired target value, namely

our original input, data.

What we are left with is to perform an averaging of the gradients.

W2grad = W2grad * (numOfExamples^-1);

b2grad = b2grad * (numOfExamples^-1);

W1grad = W1grad * (numOfExamples^-1);

b1grad = b1grad * (numOfExamples^-1);

% add the weight decay term to W2grad and W1grad

W2grad = W2grad + lambda*W2;

W1grad = W1grad + lambda*W1;

Lastly we need to add the weight decay term to the gradients as shown above.

Finally we unroll the weights and the bias values into a vector:

grad = [W1grad(:) ; W2grad(:) ; b1grad(:) ; b2grad(:)];

23

By passing the previously described cost function to be minimized to the l-BFGS optimizer

after several iterations we can get our hands on the optimal weights and the associated cost

value:

[optTheta, cost] = minFunc( @(p) sparseAutoencoderLinearCostBatch(p, ...

visibleSize, hiddenSize, ...

lambda, sparsityParam, ...

beta, patches, batchSize), ...

theta, options);

As a last step it is worthwhile and interesting to visually represent the learned features by:

W = reshape(optTheta(1:visibleSize * hiddenSize), hiddenSize, visibleSize);

b = optTheta(2*hiddenSize*visibleSize+1:2*hiddenSize*visibleSize+hiddenSize);

displayColorNetwork( (W*ZCAWhite)');

The running of the algorithm resulted in a picture shown below:

One can easily spot the edges learned from the pictures in the above image. It is simply an

image representation of the optimal weights I deduced running the optimizer.

24

The previous picture is useful for debugging purposes as well. In case most of the squares

contain only blurred images and no clear edge detection is present we should suspect that

our algorithm is not working as it is supposed to be.

As a final step if one used a profiler during execution it is advisable to stop the profiler and

save the results to identify possible bottlenecks in the code execution:

profile viewer

p = profile('info');

profsave(p,'profile_results')

d) Classification

If we look back to our run.m top level script we can see that the next step in our processing is

to step into the cnn.m file.

As a first step we initialize the environment once again with our parameters regarding the

architecture of the neural network, the dimension of the image patches, etc.

Please remember that we have saved the optimal weights of the network along with the ZCA

whitening information used at the end of our sparse auto-encoder algorithm:

save('imagenet_Features.mat', 'optTheta', 'ZCAWhite', 'meanPatch');

At this point the next is to load the data back to memory by issuing the following command:

load imagenet_Features;

This command will load into RAM the optimal features, the matrix used for ZCA whitening

and the mean values for each input variable (in our case 192).

Once again we reshape our incoming features into matrices representing the weights and the

bias term by:

W = reshape(optTheta(1:visibleSize * hiddenSize), hiddenSize, visibleSize);

b = optTheta(2*hiddenSize*visibleSize+1:2*hiddenSize*visibleSize+hiddenSize);

25

The next step to be done is to load our labeled dataset consisting the training and the test

data. These images will be used during our supervised classification task.

load trainImages;

load testImages;

Both trainImages and testImages are MATLAB format files containing raw colored pictures

in matrix form. Obtaining labeled images per each category is a very expensive process. It is

often very difficult to obtain massive amounts of labeled data per each category. One of the

ways I had collected these labeled images by going to www.flickr.com. One can search for

images by typing the required category into the search field and a reasonable amount of pic-

tures will be presented by the web site. Obviously the categorization by flickr is far from per-

fect and manual picking of images is necessary to make sure that we indeed acquiring images

of the correct category. I have devised four categories:

Cars

Cats

Tigers

Trees

By manually downloading a small number of samples from each category one can easily di-

vide them into training and test categories. As a preprocessing step it is necessary to transform

the images into a uniform and rather small dimension (64 x 64) for computational purposes

which can be done with the following function:

function IMAGES = loadandResizeImages(pathToDir, row, col)

fileFolder = fullfile(pathToDir);

dirOutput = dir(fullfile(fileFolder,'*.jpg'));


numFrames = numel(fileNames);

I = imread(fileNames{1});

% Preallocate the array

IMAGES = zeros([row col 3 numFrames],class(I));

26

for i=1:numFrames

I = imread(fileNames{i});

I = imresize(I,[row col]);

imwrite(I, strcat(pathToDir,'resized/',fileNames{i}(:,1:end-4), '.bmp'));

fprintf('Resized picture number and saved: %d\n', i);

end

Once we have the resized images on disk we should package them into MATLAB matrix

form.

At this point we are ready to apply convolution and pooling on both the training and the test

data. Convolution and pooling among others exploits certain statistical similarities of neigh-

boring parts of the image. It results in significant computational ease since we can represent

the incoming data more succinctly. In our case we convolve the learned 8 x 8 features with the

large incoming labeled images. Pooling is very useful for statistical aggregation by taking

large contiguous, non-overlapping part of the image and transforming it to a single value. It

will enable our algorithm to become much more computationally feasible.

For memory and performance reasons we conduct the convolution and pooling steps in sever-

al iterations over the features.

Lets look at the signature of the convolution function:

function convolvedFeatures = cnnConvolve(patchDim, numFeatures, images, W, b, ZCAWhite,

meanPatch)

In the next few instructions we are allocating space for our variables and for the final con-

volved features.

numImages = size(images, 4);

imageDim = size(images, 1);

imageChannels = size(images, 3);

convolvedFeatures = zeros(numFeatures, numImages, imageDim - patchDim + 1, imageDim -

patchDim + 1);

We need to apply ZCA whitening once again:

WT = W*ZCAWhite;

bT = b - WT * meanPatch;

WT = reshape(WT, numFeatures, patchDim*patchDim, imageChannels);

27

Then we conduct convolution on each image and each feature and each channel. A convolved

image is created by executing the following instructions:

feature = flipud(fliplr(squeeze(feature)));

im = squeeze(images(:, :, channel, imageNum));

convolvedImage = convolvedImage + conv2(im, feature, 'valid');

Then we apply the sigmoid function in order to produce the activations:

convolvedImage = sigmoid(convolvedImage);

The pooling is performed by the following function:

cnnPool.m

function pooledFeatures = cnnPool(poolDim, convolvedFeatures)

numImages = size(convolvedFeatures, 2);

numFeatures = size(convolvedFeatures, 1);

convolvedDim = size(convolvedFeatures, 3);

pooledFeatures = zeros(numFeatures, numImages, floor(convolvedDim / poolDim),

floor(convolvedDim / poolDim));

pool_length = floor(convolvedDim / poolDim);

rowbegin = 0;

rowend = 0;

columnbegin = 0;

columnend = 0;

for i = 1 : numFeatures

for j = 1 : numImages

for r = 1 : pool_length

for c = 1 : pool_length

rowbegin = 1 + poolDim * (r-1);

rowend = poolDim * r;

columnbegin = 1 + poolDim * (c-1);

columnend = poolDim * c;

pooledFeatures(i, j, r, c) = ...

mean(mean(convolvedFeatures(i, j, rowbegin : rowend, columnbegin :

columnend)));

end

end

end

end

28

We are pooling the features in the region poolDim x poolDim so that we are able to produce a

matrix with dimensions numFeatures, numImages and floor(convolvedDim/poolDim) where a

single value within this matrix corresponds to a single pooling region.

After we are done with convolving and pooling all the features on each image it is again rec-

ommended to save the result by:

save('pooledFeatures.mat', 'pooledFeaturesTrain', 'pooledFeaturesTest');

Now we are presented with a computationally feasible supervised classification task. We are

going to use the pooledFeaturesTrain and pooledFeaturesTest to train a softmax classifier.

softmaxLambda = 1e-4;

numClasses = 4; %2

% Reshape the pooledFeatures to form an input vector for softmax

softmaxX = permute(pooledFeaturesTrain, [1 3 4 2]);

softmaxX = reshape(softmaxX, numel(pooledFeaturesTrain) / numTrainImages,...

numTrainImages);

softmaxY = trainLabels;

options = struct;

options.maxIter = 200;

softmaxModel = softmaxTrain(numel(pooledFeaturesTrain) / numTrainImages,...

numClasses, softmaxLambda, softmaxX, softmaxY, options);

Lets open the softmaxTrain function to see how it operates exactly.

softmaxTrain.m

function [softmaxModel] = softmaxTrain(inputSize, numClasses, lambda, inputData, labels,

options)

if ~exist('options', 'var')

options = struct;

end

if ~isfield(options, 'maxIter')

options.maxIter = 400;

end

% initialize parameters

29

theta = 0.005 * randn(numClasses * inputSize, 1);

addpath minFunc/

options.Method = 'lbfgs';

minFuncOptions.display = 'on';

[softmaxOptTheta, cost] = minFunc( @(p) softmaxCost(p, ...

numClasses, inputSize, lambda, ...

inputData, labels), ...

theta, options);

softmaxModel.optTheta = reshape(softmaxOptTheta, numClasses, inputSize);

softmaxModel.inputSize = inputSize;

softmaxModel.numClasses = numClasses;

In case we did not provide the maximum iteration number in the options structure we are go-

ing to perform 400 iterations. Then we randomly initialize the variable to be optimized, theta.

We specify the optimizer to perform an l-BFGS search and we pass the function softmaxCost

to the optimizer. After we are finished with the optimization we save the results into a struc-

ture. But lets look into the cost function to gain some understanding of the softmax classifier:

softmaxCost.m

function [cost, grad] = softmaxCost(theta, numClasses, inputSize, lambda, data, labels)

theta = reshape(theta, numClasses, inputSize);

numCases = size(data, 2);

groundTruth = full(sparse(labels, 1:numCases, 1));

cost = 0;

thetagrad = zeros(numClasses, inputSize);

M = theta * data;

M = bsxfun(@minus, M, max(M, [], 1));

expM = exp(M);

normTerm = 1./sum(expM,1);

h = (repmat(normTerm',1,numClasses) .* expM')';

probSum = 0;

for i=1:numCases

for j=1:numClasses

probSum = probSum + groundTruth(j,i) * log(h(j,i));

end

end

30

cost = -1*(numCases^-1) * probSum;

weightTerm = (lambda/2) * sum(sum(theta .* theta));

cost = cost + weightTerm;

% calulate the Hessian

thetagrad = -1 * (numCases^-1) * (data * (groundTruth - h)')' + lambda*theta;

% Unroll the gradient matrices into a vector for minFunc

grad = [thetagrad(:)];

end

This function accepts the following parameters:

numClasses - the number of classes

inputSize - the size N of the input vector

lambda - weight decay parameter

data - the N x M input matrix, where each column data(:, i) corresponds to a single test

set

labels - an M x 1 matrix containing the labels corresponding for the input data

The instruction M = theta * data makes M to contain the theta*x exponents for each class and

each example. M = bsxfun(@minus, M, max(M, [], 1)) has the purpose to subtract the maximum

of each theta*x vector at each example so we do not have large values that might cause

overflow. expM = exp(M) computes the the exponents of the theta*x matrix. By normTerm =

1./sum(expM,1) we first compute the normalizing term which makes the exponents sum to one.

We then sequentially compute the probabilities where the two instructions

h = (repmat(normTerm',1,numClasses) .* expM')' and

probSum = probSum + groundTruth(j,i) * log(h(j,i))

are important.

Finally we compute the weight decay term by: weightTerm = (lambda/2) * sum(sum(theta .*

theta)) and the cost will become:

cost = -1*(numCases^-1) * probSum;

cost = cost + weightTerm;

The last step is to calculate the Hessian matrix by:

thetagrad = -1 * (numCases^-1) * (data * (groundTruth - h)')' + lambda*theta

31

Finally we unroll the gradients into vector form by: grad = [thetagrad(:)];

We are now ready to test our trained classifiers predictive capabalities on a seperate test set

consisting our labeled images.

softmaxX = permute(pooledFeaturesTest, [1 3 4 2]);

softmaxX = reshape(softmaxX, numel(pooledFeaturesTest) / numTestImages, numTestImages);

softmaxY = testLabels;

[pred] = softmaxPredict(softmaxModel, softmaxX);

acc = (pred(:) == softmaxY(:));

acc = sum(acc) / size(acc, 1);

fprintf('Accuracy: %2.3f%%\n', acc * 100);

The softmaxX will represent the pooled test images and softmaxY will carry the labels

associated with each test example. We pass into our softmaxPredict function the previously

acquired classification model along with the test examples and it will return the predictions

made on each example.

softmaxPredict.m

function [pred] = softmaxPredict(softmaxModel, data)

theta = softmaxModel.optTheta;

pred = zeros(1, size(data, 2));

M = theta * data;

M = bsxfun(@minus, M, max(M, [], 1));

expM = exp(M);

normTerm = 1./sum(expM,1);

h = (repmat(normTerm',1,softmaxModel.numClasses) .* expM')';

[y,i] = max(h);

pred = i;

end

By using the [y,i] = max(h) expression we are effectively choosing the category that has the

highest probability the example belongs to. Finally we are returning the vector containing the

labels of our categories that we predicted for each example.

Upon return we can easily compare the actual labels of the test examples and what our model

has just predicted and finally print the accuracy of our prediction:

[pred] = softmaxPredict(softmaxModel, softmaxX);

acc = (pred(:) == softmaxY(:));

32

acc = sum(acc) / size(acc, 1);

fprintf('Accuracy: %2.3f%%\n', acc * 100);

e) Image database construction

I had three ways of collecting images during my research. The first one was very

straightforward. I used the STL-10 dataset provided by a link on the web site of the UFLDL

tutorial. The goal was to ensure that my implementation worked just as in the lecture notes.

After ensuring my implementation was working well I decided to try the algorithm on a

different dataset. My strategy was to randomly download massive amount of images from the

Internet as implied in (Raina et al., 2007). I chose Flickr as my search engine for images. The

dataset for the unsupervised sparsity auto-encoder algorithm was compiled by searching for

images with the keyword life. I successfully downloaded in this general category about

2000 unlabeled images. This dataset was used to create the activation features to be used

later in the softmax classification task.

I have sampled images in four categories: car, cat, tiger, tree. I have divided each category

into a training and a test set to eliminate any sampling bias. Each training set consisted around

40-50 images in the given category. Each test set contained around 15-20 images. (This

examplifies how difficult it is to get labeled images contrasted with just simply downloading

random images.)

By training and testing I was able to achieve a ~85% accuracy rate. If I used random weights

instead of the optimized activation features in the softmax training the predicition rate

dropped to ~57%! It was clear that the sparse auto-encoder algorithm contributed a significant

improvement to the prediction performance.

The third way to acquire images was to download images from the ImageNet database. I have

dowloaded the 2011 Fall dataset which consisted around 14 million direct image urls. I wrote

a small Python script that I used to dowload randomly sampled images.

import requests

import numpy as np

imagesDownloaded=0

index = 0

33

file_database = open("fall11_urls.txt", "r")

lines = file_database.readlines()

numOfLines = len(lines)

randomIndexes = np.random.permutation(range(numOfLines))

outFolder = "2011Fall"

while imagesDownloaded

34

os.rename(filePath, newPath)

jpegCount = jpegCount + 1

elif (imgType == "" or imgType == None):

os.remove(filePath)

noTypeCount = noTypeCount + 1

else:

os.remove(filePath)

otherTypeCount = otherTypeCount + 1

numFilesChecked = numFilesChecked + 1

print numFilesChecked

print "Jpeg count: ", jpegCount

print "Other type: ", otherTypeCount

print "No type: ", noTypeCount

f) Enhancements of the standard algorithm

I have made several enhancements to the main algorithm proposed by the Stanford class. I did

the enhancements mainly to increase the memory efficiency of the learning process.

In the following section I will enumerate the changes I introduced into my implementation:

Compile the raw images into batches and save them into the standard MATLAB

format. By introducing this feature, one is capable of working with the images in batch

iterations, meaning it is not necessary to have all the images in RAM all at once. It

enabled me to create a much larger sample size.

Training the neural network with limited memory batch processing. To be fair this

was a recommendation by the UFLDL site but they did not elaborate much on the

process. By training the network in batches one does not need to read all the input into

RAM but train and optimize the network in successive iterations. This is on one hand

more efficient since the load on RAM is less and finding the optimal weights is faster

(the weights coverge faster) on the other hand it is less efficient since a redundant

factor has been introduced which is the needs to calculate the feedforward activations

twice to get the average activations to calculate the back propagation step.

Vectorized forms of all the computationally expensive algorithms. It makes the

code more difficult to read and comprehend but on the other hand MATLAB can work

with vectorized forms very efficiently due to the sophisticated matrix manipulation

libraries.

35

Testing the algorithm on real-life datasets. By this I mean that I simply tried to

download images from the Internet. I have created categories, transformed the images

to verify the validity of the algorithm.

g) Experiments

After running the sparse auto-encoder on the STL-10 sampled patches (100,000 - 8 x 8 pixel

patches) I was able to produce a visually very similar feature activation to the one shown on

the tutorials web site.

Dataset Number of categories Method Prediction Accuracy

STL-10 4 Raw (random

weights without pre-

training)

~58%

STL-10 4 Sparse auto encoder ~81%

The above tables shows the effect of pre-training the weights by the unsupervised learning

step. We can achieve much better classification performance if we support the supervised

learning task by running an unsupervised algorithm.

My experience with different datasets than the STL-10 enforced the validity of the self-taught

learning method even though I was never able to achieve better classification performance

than 85%.

36

Data set Unlabeled data

set size

Labeled data Categories Accuracy

www.flickr.com 2000 200 4 (car, cat, tiger,

tree)

85%

www.flickr.com 2000 (same as

above)

400 2 (male face,

female face)

72%

ImageNet 15000 200 4 (car, cat, tiger,

tree)

81%

The above table represents my experiments with custom datasets. I wanted to see whether the

algorithm works well with images compiled by a user simply by randomly dowloading

images. The algorithm works well differentiating significantly different objects, like cats

from trees or trees from cars. But it has serious problem deciding between cats and tigers for

instance. I observed the same issue with the STL-10 dataset. It couldnt really tell whether it

saw a cat or a non-cat mammal.

4. Conclusion

Self-taught learning holds a great promise for future research and classification tasks where

we do not have access to a plethora of labeled images since it is much easier and cheaper to

simply acquire tons of unlabeled images.

Futher direction to my research will be to change the underlying architecture and its

parameters. When I tried to use bigger datasets I was not able to show significantly greater

performance thus I suspect one promising way forward is to have stacked auto-encoders. The

idea is nicely presented at the UFLDL site6.

My next step in my further development and research will be to expand the algorithm with

stacked auto-encoders. By having only a single hidden layer one can only extract edges

from the images, working effectively as an edge detector. Inserting multiple hidden layers

could extract deeper underlying features from images. They could detect longer contours, or

6 http://ufldl.stanford.edu/wiki/index.php/Stacked_Autoencoders

37

perhaps detect simple "parts of objects." An even deeper layer might then group together these

contours or detect even more complex features7.

5. Acknowledgments

At this section I would like to express my gratitude to my supervisor, Balint Antal who

guided me at my research and presented to me the idea of self-taught learning. His help and

his guidance immensely determined the direction and the success of my research in the topic.

I would like to thank Andrew Ng for his great lectures at Coursera where I was first exposed

to the ideas and techniques of machine learning in general. His course on Machine Learning

freely available on Youtube and his handouts greatly helped me to learn the fundamentals of

machine learning. His UFLDL tutorial on deep learning ideas along with implementation tips

should be the standard of education at every institute.

7 http://ufldl.stanford.edu/wiki/index.php/Deep_Networks:_Overview

38

References

Rajat Raina, Alexis Battle, Honglak Lee, Benjamin Packer, Andrew Y. Ng, Self-taught

Learning: Transfer Learning from Unlabeled Data, 2007

Bruno A Olshausen, David J Field, Sparse coding of sensory inputs, Current Opinion in

Neurobiology, 14:481487, 2004

Honglak Lee Alexis Battle Rajat Raina Andrew Y. Ng, Efficient sparse coding algorithms,

2006

Yoshua Bengio, Aaron Courville, and Pascal Vincent, Representation Learning: A Review

and New Perspectives, 2012

Tom M. Mitchell, Machine Learning, March 1, 1997

Ethem Alpaydn, Introduction to Machine Learning, 2nd ed, The MIT Press Cambridge,

Massachusetts London, England, 2010

Hinton, G.E. Supervised learning in multilayer neural networks in The MIT Encyclopedia of

the Cognitive Sciences Editors: Robert A. Wilson and Frank C. Keil The MIT Press, 1999

Rumelhart, D. E., Hinton, G. E., and Williams, R. J. Learning representations by back-

propagating errors. Nature, 323, 533--536, 1986

Nigam, K., McCallum, A., Thrun, S., & Mitchell, T. Text classification from labeled and

unlabeled documents using EM. Machine Learning, 39, 103,134., 2000

B. A. Olshausen and D. J. Field. Emergence of simple-cell receptive field properties by

learning a sparse code for natural images. Nature, 381:607609, 1996.

B. A. Olshausen and D. J. Field. Sparse coding with an overcomplete basis set: A strategy

employed by V1? Vision Research, 37:33113325, 1997.

M. S. Lewicki and T. J. Sejnowski. Learning overcomplete representations. Neural Comp.,

12(2), 2000.

B. A. Olshausen. Sparse coding of time-varying natural images. Vision of Vision, 2(7):130,

2002.

B.A. Olshausen and D.J. Field. Sparse coding of sensory inputs. Cur. Op. Neurobiology,

14(4), 2004.

MathWorks Inc, Neural Network Product Help

39

Sources from the Internet:

http://ufldl.stanford.edu/wiki/index.php/UFLDL_Tutorial (2013. november 05.)

http://ufldl.stanford.edu/wiki/index.php/Autoencoders_and_Sparsity (2013. november 05.)

http://ufldl.stanford.edu/wiki/index.php/Neural_Networks (2013. november 05.)

http://ufldl.stanford.edu/wiki/index.php/Backpropagation_Algorithm (2013. november 05.)

http://ufldl.stanford.edu/wiki/index.php/Softmax_Regression (2013. november 05.)

http://ufldl.stanford.edu/wiki/index.php/Self-Taught_Learning (2013. november 05.)

http://ufldl.stanford.edu/wiki/index.php/Linear_Decoders (2013. november 05.)

http://ufldl.stanford.edu/wiki/index.php/Feature_extraction_using_convolution (2013.

november 05.)

http://ufldl.stanford.edu/wiki/index.php/Pooling (2013. november 05.)

http://ufldl.stanford.edu/wiki/index.php/Stacked_Autoencoders (2013. november 05.)

http://ufldl.stanford.edu/wiki/index.php/Deep_Networks:_Overview (2013. november 05.)

Documents

Self-taught learning: Implementation using MATLAB