Self-taught learning: Implementation using MATLAB

Embed Size (px)

DESCRIPTION

Self-taught learning is a new framework in the domain of machine learning. It has the potential that by using unsupervised learning strategies to automatically learn and extract information and consequently create a succint representation of the given input. I have presented the theoretical underpinnings of the idea and then proceeded to show a possible implementation of the algorithm in the MATLAB programming language. I have also included my experimental results obtained during my research.Neural networks are extensively used in the project.

Citation preview

  • SZAKDOLGOZAT

    Bod Gergely

    Debrecen

    2013

  • Debreceni Egyetem

    Informatikai Kar

    Self-taught learning:

    Implementation using MATLAB

    Tmavezet: Dr. Antal Blint Ksztette: Bod Gergely

    Beosztsa: Adjunktus Szak megnevezs:

    programtervez informatikus

  • Table of contents

    1. Machine learning primer

    a) Supervised versus Unsupervised learning

    b) Neural networks

    2. Self-taught learning

    a) Introduction

    b) Representation learning

    3. Implementation

    a) Initialization

    b) Setting up the environment for learning

    c) Feature learning

    d) Classification

    e) Image database construction

    f) Enhancements of the standard algorithm

    g) Experiments

    4. Conclusion

    5. Acknowledgments

  • 1

    1. Machine learning primer

    In this thesis I would like to present a relatively new concept that has emerged from machine

    learning. Before going into specific details regarding this new framework it is advisable to

    spend some time by elaborating what the field of machine learning deals with. Machine

    learning is universally defined as the construction of intelligent computer programs that can

    learn and with time improve its performance on some task (Mitchell, 1997).

    Machine learning is often categorized as branch of computer science and is usually considered

    a subfield of artificial intelligence. While this categorization places machine learning strictly

    into the domain of computing it is fair to mention other disciplines that might be equally if not

    more influential to the development of this science. First of all machine learning employs a

    plethora of methods, technical tools from statistics, mathematics, neuroscience, biological

    systems. One particular subfield of mathematics, namely optimization is a very important tool

    for broad applications in machine learning. Statistics has many overlapping features with

    machine learning, concepts, tools, methods but there seem to be one distinctive feature. While

    statistics emphasizes inference, the primary goal of machine learning is prediction.

    Typical industrial applications of ML include spam filtering, handwritten character

    recognition, image classification. To give a gentle example consider the following from

    Alpaydn: For some tasks, however, we do not have an algorithmfor example, to tell spam

    emails from legitimate emails. We know what the input is: an email document that in the

    simplest case is a file of characters. We know what the output should be: a yes/no output

    indicating whether the message is spam or not. We do not know how to transform the input to

    the output. What can be considered spam changes in time and from individual to individual.

    What we lack in knowledge, we make up for in data. We can easily compile thousands of

    example messages some of which we know to be spam and what we want is to learn what

    constitutes spam from them. In other words, we would like the computer (machine) to extract

    automatically the algorithm for this task. There is no need to learn to sort numbers, we

    already have algorithms for that; but there are many applications for which we do not have

    an algorithm but do have example data...Think, for example, of a supermarket chain that has

  • 2

    hundreds of stores all over a country selling thousands of goods to millions of

    customers...What the supermarket chain wants is to be able to predict who are the likely

    customers for a product. Again, the algorithm for this is not evident; it changes in time and by

    geographic location.1

    The previous citation gives a good understanding when machine learning can be employed.

    We do not have a rigid algorithm but we have a massive amount of data. This data can

    compensate us for not having an exact algorithm and by discovering certain statistical,

    mathematical patterns in the data we can construct different learning strategies.

    a) Supervised versus Unsupervised learning

    There is a particular dimension in which we can differentiate the learning algorithms. It is the

    absence or the existence of the labels we provide with each incoming example. Imagine a

    dataset which consists of images of cars and cats. The individual images have labels attached

    to them describing which category (car or cat) they belong to. We could train the system by

    splitting the dataset into two parts: a) training set b) test set. By first running the algorithm on

    the training set the algorithm can make certain corrections to the parameters of the underlying

    model whenever it sees an example and told what category it belonged to. This way it is

    possible to devise such a system that can predict at a certain statistical predictive power that a

    fresh example that has not been observed by the algorithm before which category belongs to.

    In contrast when employing unsupervised learning we are not providing the labels (concretely

    the metadata about the category) of the examples. In this particular case we are trying to find

    some hidden pattern in the underlying data. It is very well summarized by Alpaydin: In

    supervised learning, the aim is to learn a mapping from the input to an output whose correct

    values are provided by a supervisor. In unsupervised learning, there is no such supervisor and

    we only have input data.

    The aim is to find the regularities in the input. There is a structure to the input space such that

    certain patterns occur more often than others, and we want to see what generally happens

    and what does not. In statistics, this is called density estimation.2

    1Ethem Alpaydn, Introduction to Machine Learning, 2nd ed, The MIT Press Cambridge, Massachusetts

    London, England, 2010, pages 1-2 2Ethem Alpaydn, Introduction to Machine Learning, 2nd ed, The MIT Press Cambridge, Massachusetts

    London, England, 2010, page 11

  • 3

    b) Neural networks

    Since the algorithm devised by Andrew Ng at Stanford University relies heavily on the use of

    neural networks (I would even say that they are the workhorse of the algorithm) I feel the

    need to spend some time to explain what they are and when they are useful. The idea of a

    neural network was inspired by a biological construct, namely the brain. The brain consists

    millions of neurons that form a very complex information processing and forwarding system.

    Neural networks consist of simple processing units that interact via weighted connections.

    They are sometimes designed in hardware but most research nowadays involves software

    simulations. They were originally inspired by ideas about how the brain makes its

    computations (Hinton, 1999).

    A typical processing unit first computes a total input which is a weighted sum of the

    incoming values from other units plus a bias term. The next step is to put its total input

    through an activation function to calculate the activity of the unit. One of the most common

    activation function is the sigmoid where y = 1/(1+exp(-x)) (Hinton, 1999).

    One of the most interesting properties of neural networks is their ability to learn from

    examples by adapting the weights on the connections. The most widely adopted machine

    learning algorithms are supervised: they assume that there is a set of training examples, each

    consisting of an input vector and a desired output vector. Learning involves sweeping through

    the training set couple of times and gradually adjusting the weights so that the actual output

    produced by the network gets closer to the desired output. The simplest neural network

    architecture is built of some input units with directed, weighted connections to an output unit.

    (Hinton, 1999)

  • 4

    By the introduction of so called hidden layers neural networks can express complicated

    nonlinear units between the input and the output. Finding the optimal weights is generally

    computationally impossible but gradient methods can be effectively used to find sets of

    weights that work well for many practical real life tasks.

    An algorithm called back propagation (Rumelhart et al., 1986) can be used to compute the

    derivatives with respect to each weight in the network of the error function. The standard

    error function is the squared difference between the actual and the desired outputs (Hinton,

    1999).

    Source: MathWorks, Neural Network Product help

    Illustration B: Describes a single neuron with input p and weight w.

    A. Illustration

    Source: The MathWorks Inc, Neural Network Product help

  • 5

    For each training case the activities of the units are calculated by a forward pass through the

    network. Then starting with the output units a backward pass is done through the network to

    compute the derivatives of the error function with respect to the total input received by each

    unit (Hinton, 1999).

    Below one can see a graphical representation of the one-layer neural network architecture.

    C. Illustration

    Source: MathWorks, Neural Network Product help

    Illustration C depicts a one-layer network with R input elements and S neurons.

    In this network, each element of the input vector p is connected to each neuron input through

    the weight matrix W. The ith neuron has a summer function that gathers its weighted inputs

    and bias to form its own scalar output, n(i). The various n(i) taken together form an S-element

    net input vector, n. Finally, the neuron layer outputs form a column vector a. The expression

    for a is shown at the bottom of the picture. (MathWorks, Neural Network Product help)

    Note that it is not uncommon for the number of inputs to a layer to be different from the

    number of neurons (i.e., R is not necessarily equal to S). A layer is not limited to have the

    number of its inputs equal to the number of its neurons. (MathWorks, Neural Network

    Product help)

    A network is not constrained to one hidden layer. One can stack multiple layers onto each

  • 6

    other where each layer's input will become the output of the previous one.

    Source: MathWorks, Neural Network Product help

    Illustration D: Matrix (IW1,1) having a source 1 (second index) and a destination 1 (first

    index). Elements of layer 1, such as its bias, net input, and output have a superscript 1 to say

    that they are associated with the first layer.

    2. Self-taught learning

    In this chapter I will introduce the idea of a new framework postulated by Raina et al. in their

    paper (Raina et al., 2007.).

    a) Introduction

    This new framework has the potential to use unlabeled data to enhance supervised

    classification tasks. They (Raina et al., 2007) do not assume that the unlabeled data follows

    the same class labels or distribution as the labeled dataset. Consequently they were able to use

    a large number of unlabeled images or other input types like audio or text that are downloaded

    randomly from the Internet. This huge unlabeled dataset is to be used to enhance the

    performance of the supervised classification task. Since such unlabeled data is much easier

    D. Illustration

  • 7

    and cheaper to be obtained than labeled data, it is possible to apply self-taught learning to

    broad range of practical problems. Their approach uses sparse coding to construct higher-

    level features using the unlabeled data. These features form a compact input representation

    and in theory can significantly improve the classification performance. (Raina et al., 2007)

    Their approach is motivated by the observation that even many randomly downloaded images

    will contain basic visual patterns (such as edges) that are similar to the images to be originally

    classified. Therefore, we can learn to recognize such patterns from the unlabeled data, these

    patterns can be used for the supervised learning task of interest. (Raina et al, 2007)

    They make the distinction from earlier method semi-supervised learning by stating: the

    unlabeled data does not share the class labels or the generative distribution of the labeled data.

    For example, given unlimited access to natural sounds (audio), can we perform better speaker

    identification?3 (Raina et al, 2007)

    The similarity with semi-supervised learning (Nigam et al., 2000) is that both use labeled and

    unlabeled data for the classification task. But unlike semi-supervised learning their new

    method they propose that we do not assume that the unlabeled data can be assigned to the

    supervised learning tasks class labels.3

    The consequence is that acquiring unlabeled images is far easier than to acquire the same

    amount of labeled data from a certain category since it is perfectly sufficient to randomly

    download for example 100,000 images from the Internet.

    Their argument also enjoys a biological motivation. It has been long assumed by

    neuroscientists that most human learning is performed by an unsupervised fashion.

    Their approach was split to the following two stages: First we learn a representation using

    only unlabeled data. Then, we apply this representation to the labeled data, and use it for the

    classification task. Once the representation has been learned in the first stage it can then be

    applied repeatedly to different classification tasks.3

    b) Representation learning

    Much of our machine learning tasks are hindered by the fact that it is often very difficult to

    recognize the underlying factors that contribute the most explanatory power. Nowadays a

    common method is to manually preprocess the data by discovering the most fundamental

    3 Rajat Raina, Alexis Battle, Honglak Lee, Benjamin Packer, Andrew Y. Ng, Self-taught Learning: Transfer

    Learning from Unlabeled Data, 2007, pages 1- 2

  • 8

    factors and transform the data so a traditional classifier etc. can solve the problem at hand

    effectively but it often requires industry specific knowledge. Feature engineering like this is

    important but also has the weakness that it is labor extensive and difficult to apply in a general

    context (Bengio, 2012). Recent advances in machine learning concentrates on representation

    learning which can broaden the applicability of our already known algorithms to

    classify/predict on the data. Self-taught learning holds the promise by incorporating several

    methods from representation learning, like sparse auto-encoders, deep networks by

    automating the process of feature extraction and augmenting the process of the final

    classification/prediction task.

    3. Implementation

    At the beginning of my research I have decided to start by implementing the proposed

    algorithms presented on the website of the Stanford class, UFLDL (Unsupervised Feature

    Learning and Deep Learning4 where some framework related, mostly initialization code was

    available and I have included them in my implementation. I have coded most of the

    algorithms both in MATLAB and Numpy, Scipy (Numpy and Scipy are extensions to the

    Python language for numerical computations). My observation was that MATLAB provided

    me a much better environment for rapid development and prototyping. The convenience of

    the debugger in MATLAB made a big impact on the speed of my algorithm development.

    Considering the above I have decided to provide the code snippets in MATLAB instead of

    Python (some of the utility scripts were nevertheless scripted in Python).

    a) Initialization

    The MATLAB environment I have created consists a top-level script file called run.m that has

    the job to start the profiling of the whole algorithm and to kick-off the several stages of the

    process. Lets take a look at the code:

    4 http://ufldl.stanford.edu/wiki/index.php/UFLDL_Tutorial

  • 9

    run.m

    start = tic;

    linearDecoder;

    cnn;

    elapsed = toc(start);

    fprintf('Elapsed time: %d', elapsed);

    fprintf('Done.');

    By calling the tic instruction we tell the environment to assign to the start variable the current

    time. It will be used after the algorithm finished by calling the toc function by passing the

    start variable and the result will be stored in the elapsed variable which we can use to print

    the elapsed number of seconds.

    b) Setting up the environment for learning

    At this point we should step into the linearDecoder.m script file that consists several tasks.

    Specifically it needs to:

    Initialize the architecture and the specific tasks related parameters

    Optionally create the image batches

    Apply ZCA whitening

    Define the number of image patches to work with

    Randomly initialize the weights of the neural network

    Start the optimization of the the neural networks cost function

    Save the optimal weights that will become our optimal features we can use for

    representation

    Visualize the learned features

    Now lets look at the parameter initialization section:

    imageChannels = 3; % number of channels (rgb, so 3)

    patchDim = 8; % patch dimension

    numPatches = 100000 % number of patches

  • 10

    visibleSize = patchDim * patchDim * imageChannels; % number of input units

    outputSize = visibleSize; % number of output units

    hiddenSize = 400; % number of hidden units

    sparsityParam = 0.035; % desired average activation of the hidden units.

    lambda = 3e-3; % weight decay parameter

    beta = 5; % weight of sparsity penalty term

    epsilon = 0.1; % epsilon for ZCA whitening

    I set the number of image channels to three since we will be processing colored - RGB

    pictures where we store an image in a 3 dimensional matrix where one dimension holds the

    values for the red-green-blue components only. The variable numPatches holds the value of

    the number of patches we would like to extract from our unlabeled image dataset. I set the

    size of one patch to a 8 x 8 matrix by specifying patchDime to be eight. We need to take into

    account how much RAM memory our system has. By setting the number of patches to

    100,000 we will be allocating a total of (8 x 8 x 3) x (100000) number of double words which

    is a total of ~153 MB.

    In the next section we calculate the number of input and output units. In our case it will be

    now 192, since we have 8 x 8 x 3 = 192 number of incoming value on one image patch

    example. We set the number of hidden units to 400.

    By setting the sparsity parameter in the next line to a close to zero number we are eventually

    driving the neural network to make most of its hidden units inactive since we want only a few

    number of nodes to be active, essentially in firing mode. The lambda and beta will have an

    impact on the behaviour of the neural networks cost function.

    Let me proceed further in linearDecoder.m:

    batchSize = 25000 % should be less than the number of patches

    patches = sampleIMAGES_fromBatches(C:\folder, numPatches);

    save 'patches_imageNet.mat' patches

    The variable batchSize we specify will tell the algorithm later in how many batches the

    optimization of the cost function should happen. This is required for limited-memory

  • 11

    processing since one does not need to load all the data this way into RAM. In our case we

    have 100,000 sample patches and this enables us to divide the processing into four seperate

    iterations, consequently putting a lot less strain on the main memory.

    The sampleIMAGES_fromBatches function takes three arguments. The first one will be a

    string denoting the folder consisting our saved batches of images. Let me first show the

    MATLAB code to create these image batches.

    saveImagesToFiles.m

    function IMAGES = saveImagesToFiles(pathFromDir,saveDir,row, col)

    numOfImagesInOneBatch = 500;

    fileFolder = fullfile(pathFromDir);

    dirOutput = dir(fullfile(fileFolder,'*.jpg'));

    fileNames = {dirOutput.name}';

    numOfImages = numel(fileNames);

    I = imread(fileNames{1});

    numOfBatches = floor(numOfImages / numOfImagesInOneBatch);

    numOfRemainder = mod(numOfImages, numOfImagesInOneBatch);

    fprintf('Number of images to save: %d\n', numOfImages);

    fprintf('Number of images in one batch: %d\n', numOfImagesInOneBatch);

    fprintf('Number of batches: %d\n', numOfBatches + 1);

    numberOfImagesSaved = 0;

    for n=1:numOfBatches

    batchName = strcat('imageBatch_', num2str(n));

    % Preallocate the batch

    IMAGES = zeros([row col 3 numOfImagesInOneBatch],class(I));

    for i=1:numOfImagesInOneBatch

    currentImage = (n-1)*numOfImagesInOneBatch + i;

  • 12

    I = imread(fileNames{currentImage});

    if not(numel(size(I)) == 3)

    fprintf(fileNames{currentImage})

    continue

    end

    IMAGES(:,:,:,i) = I;

    numberOfImagesSaved = numberOfImagesSaved + 1;

    fprintf('Loaded image %d in batch %d. Global image number: %d\n', i, n,

    numberOfImagesSaved);

    end

    save(strcat(saveDir,'/', batchName, '.mat'), 'IMAGES');

    fprintf('Saved batch number %d out %d\n', n, numOfBatches + 1 );

    clear IMAGES;

    end

    if not(numOfRemainder == 0)

    batchName = strcat('imageBatch_', 'remainder_', num2str(numOfRemainder));

    % Preallocate the batch

    IMAGES = zeros([row col 3 numOfRemainder],class(I));

    for i=1:numOfRemainder

    currentImage = numOfBatches*numOfImagesInOneBatch + i;

    I = imread(fileNames{currentImage});

    if not(numel(size(I)) == 3)

    fprintf(fileNames{currentImage})

    continue

    end

    IMAGES(:,:,:,i) = I;

    numberOfImagesSaved = numberOfImagesSaved + 1;

    fprintf('Loaded image %d in batch %d. Global image number: %d\n', i, n,

    numberOfImagesSaved);

    end

    save(strcat(saveDir,'/', batchName, '.mat'), 'IMAGES');

    fprintf('Saved the remainder batch with %d images in it.\n', numOfRemainder);

    clear IMAGES;

    end

  • 13

    We pass the directory consisting our original images as the first parameter. The second

    parameter specifies the folder to save the transformed and packaged images to. The row, col

    parameters will denote the dimension into we would like to transform our images to. We can

    specify in the function how many images we want to have in on batch file. There is always a

    trade-off when working with batches of data. The more batches we have the less need we

    have for memory but on the other hand it will require many I/O instructions to move the data

    from disk to RAM and the code readabilty suffers too. In case we have a plethora of memory

    we should set this number to a higher one so less I/O instructions will need to be performed

    but it will incur a higher pressure on the main memory.

    In the next steps we calculate the number of batches required to hold all the images with the

    given extension from the given folder. In the for loop we start putting together the batches by

    reading each image into memory by calling the imread built-in MATLAB function. The

    dimension of the batch matrices are the following: we need the number of rows, columns,

    color channels (three) and the number of images to be stored in one batch. As exception

    handling I am guarding against invalid images that do not adhere to having a three

    dimensional structure by simply skipping and logging the name of the particular image.

    After this short diversion lets get back to our function, sampleIMAGES_fromBatches. Here

    the first parameter denotes the folder where we now store the images that we have packaged

    into batch files. We tell how many patches we want to create, which in our case is 100,000.

    patchsize = 8; % Use 8x8 patches

    numpatches = numberOfPatchesToCreate;

    numOfChannels = 3;

    % Initialize patches with zeros.

    patches = zeros(patchsize*patchsize*numOfChannels, numpatches);

    fileFolder = fullfile(directoryOfBatches);

    dirOutput = dir(fullfile(fileFolder,'*.mat'));

    fileNames = {dirOutput.name}';

    numOfBatches = numel(fileNames);

  • 14

    samplesCreated = 0;

    for n = 1:numOfBatches

    load(fileNames{n});

    fprintf('Sampling from batch: %s\n', fileNames{n});

    numOfImagesInCurrentBatch = size(IMAGES,4);

    batchSize = numberOfPatchesToCreate/numOfBatches;

    for i = 1:batchSize

    randImg = randi(numOfImagesInCurrentBatch);

    patchRowStart = randi(505);

    patchRowRange = patchRowStart:patchRowStart+patchsize-1;

    patchColumnStart = randi(505);

    patchColumnRange = patchColumnStart:patchColumnStart+patchsize-1;

    patchSample = zeros(patchsize*patchsize*numOfChannels,1);

    for j=1:numOfChannels

    patchSample((j-1)*patchsize*patchsize+1:(j*patchsize*patchsize),1) =

    reshape(IMAGES(patchRowRange,patchColumnRange,j,randImg),patchsize*patchsize,1);

    end

    patches(:,(n-1)*batchSize + i) = patchSample;

    samplesCreated = samplesCreated + 1;

    end

    end

    fprintf('Number of samples created: %d\n', samplesCreated);

    In the above code excerpt we are randomly selecting 8 x 8 x 3 dimensional patches that we

    are going to feed into our neural network. The return value will be a matrix consisting the

    randomly sampled 100,000 patches each having a size of 192 values.

    Now going back to the linearDecoder.m:

    patches = sampleIMAGES_fromBatches(C:\folder, numPatches);

    save 'patches_imageNet.mat' patches

  • 15

    We now have the sampled patches (192 x 100,000) in the patches variable. For convenience it

    is recommended to save this matrix into a MATLAB file making it persistent on disk in case

    we want to rerun the algorithm.

    Proceed forward in linearDecoder.m:

    % Subtract mean patch

    meanPatch = mean(patches, 2);

    patches = bsxfun(@minus, patches, meanPatch);

    % Apply ZCA whitening

    sigma = patches * patches' / numPatches;

    [u, s, v] = svd(sigma);

    ZCAWhite = u * diag(1 ./ sqrt(diag(s) + epsilon)) * u';

    patches = ZCAWhite * patches;

    In the next step we are going to normalize the values of the patches by subtracting the mean

    values from each patch then apply ZCA whitening.

    c) Feature learning

    Now we got to the exciting part. This section does the heavy lifting by optimizing the weights

    of the neural network to yield the feature activation vector. Lets go back to our code:

    theta = initializeParameters(hiddenSize, visibleSize);

    The initializeParameters function takes two incoming arguments, namely the number of

    hidden units and the number of input/output units, these are one of the most important

    characteristics of the architecture of any neural network.

    initializeParameters.m

    function theta = initializeParameters(hiddenSize, visibleSize)

    %% Initialize parameters randomly based on layer sizes.

    r = sqrt(6) / sqrt(hiddenSize+visibleSize+1); % we'll choose weights uniformly from the

    interval [-r, r]

  • 16

    W1 = rand(hiddenSize, visibleSize) * 2 * r - r;

    W2 = rand(visibleSize, hiddenSize) * 2 * r - r;

    b1 = zeros(hiddenSize, 1);

    b2 = zeros(visibleSize, 1);

    % Convert weights and bias gradients to the vector form.

    % This step will "unroll" (flatten and concatenate together) all

    % parameters into a vector

    theta = [W1(:) ; W2(:) ; b1(:) ; b2(:)];

    In the above code snippet we will get a randomly initialized vector of size 154,192 where the

    hidden size is 400 and visible size is 192 (which is the dimension of a single image patch).

    Now we are ready to look at the code where we are going to conduct the feature

    optimization step.

    % Use minFunc to minimize the function

    addpath minFunc/

    options = struct;

    options.Method = 'lbfgs';

    options.maxIter = 200

    options.display = 'on';

    [optTheta, cost] = minFunc( @(p) sparseAutoencoderLinearCostBatch(p, ...

    visibleSize, hiddenSize, ...

    lambda, sparsityParam, ...

    beta, patches, batchSize), ...

    theta, options);

    % Save the learned features and the preprocessing matrices

    fprintf('Saving learned features and preprocessing matrices...\n');

    save('imagenet_Features.mat', 'optTheta', 'ZCAWhite', 'meanPatch');

    fprintf('Saved\n');

  • 17

    We are registering to MATLAB the path to our optimizer which is provided on the Stanford

    class site5. This optimizer is capable of conducting a flavour of the limited memory BFGS

    optimization. When I experimented with one of MATLABs built-in optimizer (fminunc) I

    had memory problems since it was impossible to obtain the Hessian matrix of such a huge

    vector (154,192 free variables). A limited BFGS optimizer tries to approximate the Hessian

    matrix without building the whole matrix in memory. As a side note I would like to point out

    that I found that Scipy provides the minimize function in its optimization library that can do l-

    bfgs out of the box.

    The minFunc optimizer takes a function pointer to the cost function to be minimized along

    with the variables (in this case the features, theta) and the options that governs the behaviour

    of the optimizer. The function will yield the optimized variables and the cost associated with

    those variable values.

    After we have obtained those values, it is again recommended to save the optimal features to

    the disk so they are available the next time for progression in experimenting.

    Now we have reached one of the most important algorithm in this thesis. Namely the cost

    function of the neural network that will act as an auto-encoder that has the task of producing a

    good approximation of the input values by devising the optimal feature activations. It is time

    for code excerpts once again.

    function [cost,grad] = sparseAutoencoderLinearCostBatch(theta, visibleSize, hiddenSize,

    lambda,sparsityParam,beta,data,batchSizeParam)

    The above represents the signature of the cost function that will return upon completion the

    cost, the gradients which will be used by the optimizer.

    Let me enumerate the parameters of this function:

    theta: the weight values to be used by the neural network (this is to be optimized)

    visibleSize: number of nodes in the input/output layer

    hiddenSize: number of nodes in the hidden layer

    lambda: weight decay parameter

    sparsityParam: this poses an incentive on the hidden unit average activations to

    converge to this value

    beta: weight of the sparsity penalty

    5 http://ufldl.stanford.edu/wiki/index.php/UFLDL_Tutorial

  • 18

    data: the matrix consisting the training data (in our case a 192 x 100,000 matrix)

    As a first step we distribute the variables in the theta vector among the weight parameters of

    the neural network.

    W1 = reshape(theta(1:hiddenSize*visibleSize), hiddenSize, visibleSize);

    W2 = reshape(theta(hiddenSize*visibleSize+1:2*hiddenSize*visibleSize), visibleSize,

    hiddenSize);

    b1 = theta(2*hiddenSize*visibleSize+1:2*hiddenSize*visibleSize+hiddenSize);

    b2 = theta(2*hiddenSize*visibleSize+hiddenSize+1:end);

    Lets proceed forward:

    numOfExamples = size(data,2);

    batchSize = batchSizeParam;

    numOfBatches = numOfExamples/batchSize;

    hwb = zeros(visibleSize,batchSize);

    a2 = zeros(hiddenSize, batchSize);

    pj = zeros(hiddenSize,batchSize);

    summa = 0;

    batchStart = 1;

    batchEnd = batchSize;

    for t=1:numOfBatches

    %forward propagation optimized with vectorization

    z2 = W1 * data(:,batchStart:batchEnd) + repmat(b1,1,batchSize);

    a2 = sigmoid(z2);

    z3 = W2 * a2 + repmat(b2,1,batchSize);

    hwb = z3; % makes it a LINEAR DECODER!

    ... to be continued.

    First we assign the number of examples we are providing to the variable numOfExamples. We

    assign the number of examples in one batch to the variable batchSize which we chose to be

    25,000. I found this to be a good number on a medium priced PC where I conducted my

    experiments since having a total of 100,000 examples will require four iterations. We are

    faced with a trade-off again here as with any batch processing. MATLAB has very powerful

    linear algebra libraries that can exploit vectorized form calculations very efficiently which

    makes a strong incentive to pass a matrix with all the examples in it. But that might be

    impossible or might put enormous pressure on the memory system. On the other hand using

    smaller batches of the examples instead of a big one, we can make the optimizer converge

  • 19

    faster to the optimal value and also put less pressure on RAM. I found the number of 25,000

    example size in one batch a good compromise.

    The next step in our calculations is to start forward propagation. Using the weights of the

    input hidden layer we can determine the hidden unit activations a2 on each example in the

    given batch. Please note that here we are using vectorized notation which is a first-line

    optimization method, practically necessary in MATLAB. Unfortunately, it makes the

    readability of the code much more obscure but the speed gains are worth it in almost every

    scenario. At the time of rapid prototyping I used for loops which called the MATLAB

    interpreter and resulted in a serious degradation in performance.

    We are using the sigmoid function to decide on a node being active or not. The sigmoid

    function has the property that it maps 0 to 0.5 but coverges to +1 very rapidly when the input

    gets relatively large and coverges to zero as the input gets a relatively large negative value.

    The MATLAB code to compute the sigmoid function:

    function sigm = sigmoid(x)

    sigm = 1 ./ (1 + exp(-x));

    end

    Source: Mathworks, Product help

    z3 = W2 * a2 + repmat(b2,1,batchSize);

    hwb = z3; % makes it a LINEAR DECODER!

    The next step in the calculation is to have the activations performed on the output units where

    the chosen method is to use a linear activation function instead of a sigmoid since a linear

  • 20

    function will not constrain the output values to the [0,1] interval as the sigmoid does. Specifi-

    cally our linear function is the identity function that we apply on z3 that we can assign to the

    output unit variable, hwb right away.

    Lets proceed forward. Now compute the squared error term of the cost function where we

    are penalizing any high deviation of the output activation, hwb from the original input varia-

    ble, data. The incentive of the neural network becomes that the target value should be equal to

    the input.

    for l=1:batchSize

    diff = hwb(:,l) - data(:,(t-1)*batchSize+l);

    summa = summa + (diff'*diff) * 0.5;

    end

    The next line will assign the current activations to pj which we are using to accumulate the

    sum of the activations of each example so we will be able to calculate an average activation

    per node.

    pj = pj + a2;

    The next two lines represent simply a housekeeping part where we increment the variable

    holding the current examples to be processed. It is also the end of the for loop that iterates

    through each batch in order to complete the feed forward step.

    batchStart = batchStart + batchSize;

    batchEnd = batchEnd + batchSize;

    end % end of 'for t=1:numOfBatches'

    As can be seen in the next line we calculate the average activations of the output unit.

    pj = sum(pj,2) * numOfExamples^-1;

    Then we have our cost term by averaging over all of the examples:

    cost = summa / numOfExamples;

    In the next code section we can see how the weight decay term is calculated:

    W2Sum = sum(sum(W2 .* W2));

  • 21

    W1Sum = sum(sum(W1 .* W1));

    cost = cost + (lambda/2) * (W1Sum + W2Sum);

    The vectorized notation obscures the code again. The weight decay term acts as a normaliza-

    tion agent that has the purpose to avoid over-fitting. Lets say we have our cost function

    measuring the error term E(w) given w, denoting the weight matrix. Then we should use

    Emod(w) = E(w) + (lambda/2) * w^2. It has the effect of penalizing large weights.

    We should proceed now to the next step:

    p = sparsityParam;

    sumKL = sum(p*(log(p./pj)) + (1-p)*log((1-p) ./ (ones(hiddenSize,1)-pj)));

    cost = cost + beta * sumKL;

    In the above code snippet we are making use of our desired sparsity parameter of the activa-

    tions. We are effectively penalizing any deviation from our chosen sparsity value and assign-

    ing that penalty to the cost function. The method of calculating this penalty is based on the

    Kullback-Leibler divergence which measures the difference between two probability distribu-

    tions. It tries to measure the information loss that occurs when we are trying to approximate or

    substitute one distribution with a different distribution. Here the distributions have the mean

    of p and pj where p represents our desire and pj is the actual mean.

    Finally we calculate the sparsity term by:

    sparsityTerm = beta*( (-1*p)./pj + (1-p)./(ones(hiddenSize,1)-pj) );

    Now we got to the point that we are ready to apply backward-propagation:

    % HERE CALCULATE IN A LOOP THE FEEDFORWARD ACTIVATIONS AGAIN!

    batchStart = 1;

    batchEnd = batchSize;

    for t=1:numOfBatches

    z2 = W1 * data(:,batchStart:batchEnd) + repmat(b1,1,batchSize);

    a2 = sigmoid(z2);

    z3 = W2 * a2 + repmat(b2,1,batchSize);

    hwb = z3; % makes it a LINEAR DECODER!

    errorOutput = (-1*(data(:,batchStart:batchEnd) - hwb)); % in case a linear decoder the

    derivOutput is not needed since it's derivative is 1

    derivHidden = a2 .* (ones(hiddenSize, batchSize) - a2);

    % add the sparsity term!

  • 22

    errorHidden = (W2'*errorOutput + repmat(sparsityTerm, 1,batchSize) ).*derivHidden;

    W2grad = W2grad + errorOutput * a2';

    b2grad = b2grad + sum(errorOutput,2);

    W1grad = W1grad + errorHidden * data(:,batchStart:batchEnd)';

    b1grad = b1grad + sum(errorHidden,2);

    batchStart = batchStart + batchSize;

    batchEnd = batchEnd + batchSize;

    end

    We are required to calculate the feed-forward activations once again due to our batch pro-

    cessing solution. It incurs a performance penalty on our algorithm but makes it possible to

    operate on machines with limited memory resources. By applying the backward-propagation

    we have a very efficient way to calculate the derivatives of our cost function. We are effec-

    tively calculating the error terms between our activations and the desired target value, namely

    our original input, data.

    What we are left with is to perform an averaging of the gradients.

    W2grad = W2grad * (numOfExamples^-1);

    b2grad = b2grad * (numOfExamples^-1);

    W1grad = W1grad * (numOfExamples^-1);

    b1grad = b1grad * (numOfExamples^-1);

    % add the weight decay term to W2grad and W1grad

    W2grad = W2grad + lambda*W2;

    W1grad = W1grad + lambda*W1;

    Lastly we need to add the weight decay term to the gradients as shown above.

    Finally we unroll the weights and the bias values into a vector:

    grad = [W1grad(:) ; W2grad(:) ; b1grad(:) ; b2grad(:)];

  • 23

    By passing the previously described cost function to be minimized to the l-BFGS optimizer

    after several iterations we can get our hands on the optimal weights and the associated cost

    value:

    [optTheta, cost] = minFunc( @(p) sparseAutoencoderLinearCostBatch(p, ...

    visibleSize, hiddenSize, ...

    lambda, sparsityParam, ...

    beta, patches, batchSize), ...

    theta, options);

    As a last step it is worthwhile and interesting to visually represent the learned features by:

    W = reshape(optTheta(1:visibleSize * hiddenSize), hiddenSize, visibleSize);

    b = optTheta(2*hiddenSize*visibleSize+1:2*hiddenSize*visibleSize+hiddenSize);

    displayColorNetwork( (W*ZCAWhite)');

    The running of the algorithm resulted in a picture shown below:

    One can easily spot the edges learned from the pictures in the above image. It is simply an

    image representation of the optimal weights I deduced running the optimizer.

  • 24

    The previous picture is useful for debugging purposes as well. In case most of the squares

    contain only blurred images and no clear edge detection is present we should suspect that

    our algorithm is not working as it is supposed to be.

    As a final step if one used a profiler during execution it is advisable to stop the profiler and

    save the results to identify possible bottlenecks in the code execution:

    profile viewer

    p = profile('info');

    profsave(p,'profile_results')

    d) Classification

    If we look back to our run.m top level script we can see that the next step in our processing is

    to step into the cnn.m file.

    As a first step we initialize the environment once again with our parameters regarding the

    architecture of the neural network, the dimension of the image patches, etc.

    Please remember that we have saved the optimal weights of the network along with the ZCA

    whitening information used at the end of our sparse auto-encoder algorithm:

    save('imagenet_Features.mat', 'optTheta', 'ZCAWhite', 'meanPatch');

    At this point the next is to load the data back to memory by issuing the following command:

    load imagenet_Features;

    This command will load into RAM the optimal features, the matrix used for ZCA whitening

    and the mean values for each input variable (in our case 192).

    Once again we reshape our incoming features into matrices representing the weights and the

    bias term by:

    W = reshape(optTheta(1:visibleSize * hiddenSize), hiddenSize, visibleSize);

    b = optTheta(2*hiddenSize*visibleSize+1:2*hiddenSize*visibleSize+hiddenSize);

  • 25

    The next step to be done is to load our labeled dataset consisting the training and the test

    data. These images will be used during our supervised classification task.

    load trainImages;

    load testImages;

    Both trainImages and testImages are MATLAB format files containing raw colored pictures

    in matrix form. Obtaining labeled images per each category is a very expensive process. It is

    often very difficult to obtain massive amounts of labeled data per each category. One of the

    ways I had collected these labeled images by going to www.flickr.com. One can search for

    images by typing the required category into the search field and a reasonable amount of pic-

    tures will be presented by the web site. Obviously the categorization by flickr is far from per-

    fect and manual picking of images is necessary to make sure that we indeed acquiring images

    of the correct category. I have devised four categories:

    Cars

    Cats

    Tigers

    Trees

    By manually downloading a small number of samples from each category one can easily di-

    vide them into training and test categories. As a preprocessing step it is necessary to transform

    the images into a uniform and rather small dimension (64 x 64) for computational purposes

    which can be done with the following function:

    function IMAGES = loadandResizeImages(pathToDir, row, col)

    fileFolder = fullfile(pathToDir);

    dirOutput = dir(fullfile(fileFolder,'*.jpg'));

    fileNames = {dirOutput.name}';

    numFrames = numel(fileNames);

    I = imread(fileNames{1});

    % Preallocate the array

    IMAGES = zeros([row col 3 numFrames],class(I));

  • 26

    for i=1:numFrames

    I = imread(fileNames{i});

    I = imresize(I,[row col]);

    imwrite(I, strcat(pathToDir,'resized/',fileNames{i}(:,1:end-4), '.bmp'));

    fprintf('Resized picture number and saved: %d\n', i);

    end

    Once we have the resized images on disk we should package them into MATLAB matrix

    form.

    At this point we are ready to apply convolution and pooling on both the training and the test

    data. Convolution and pooling among others exploits certain statistical similarities of neigh-

    boring parts of the image. It results in significant computational ease since we can represent

    the incoming data more succinctly. In our case we convolve the learned 8 x 8 features with the

    large incoming labeled images. Pooling is very useful for statistical aggregation by taking

    large contiguous, non-overlapping part of the image and transforming it to a single value. It

    will enable our algorithm to become much more computationally feasible.

    For memory and performance reasons we conduct the convolution and pooling steps in sever-

    al iterations over the features.

    Lets look at the signature of the convolution function:

    function convolvedFeatures = cnnConvolve(patchDim, numFeatures, images, W, b, ZCAWhite,

    meanPatch)

    In the next few instructions we are allocating space for our variables and for the final con-

    volved features.

    numImages = size(images, 4);

    imageDim = size(images, 1);

    imageChannels = size(images, 3);

    convolvedFeatures = zeros(numFeatures, numImages, imageDim - patchDim + 1, imageDim -

    patchDim + 1);

    We need to apply ZCA whitening once again:

    WT = W*ZCAWhite;

    bT = b - WT * meanPatch;

    WT = reshape(WT, numFeatures, patchDim*patchDim, imageChannels);

  • 27

    Then we conduct convolution on each image and each feature and each channel. A convolved

    image is created by executing the following instructions:

    feature = flipud(fliplr(squeeze(feature)));

    im = squeeze(images(:, :, channel, imageNum));

    convolvedImage = convolvedImage + conv2(im, feature, 'valid');

    Then we apply the sigmoid function in order to produce the activations:

    convolvedImage = sigmoid(convolvedImage);

    The pooling is performed by the following function:

    cnnPool.m

    function pooledFeatures = cnnPool(poolDim, convolvedFeatures)

    numImages = size(convolvedFeatures, 2);

    numFeatures = size(convolvedFeatures, 1);

    convolvedDim = size(convolvedFeatures, 3);

    pooledFeatures = zeros(numFeatures, numImages, floor(convolvedDim / poolDim),

    floor(convolvedDim / poolDim));

    pool_length = floor(convolvedDim / poolDim);

    rowbegin = 0;

    rowend = 0;

    columnbegin = 0;

    columnend = 0;

    for i = 1 : numFeatures

    for j = 1 : numImages

    for r = 1 : pool_length

    for c = 1 : pool_length

    rowbegin = 1 + poolDim * (r-1);

    rowend = poolDim * r;

    columnbegin = 1 + poolDim * (c-1);

    columnend = poolDim * c;

    pooledFeatures(i, j, r, c) = ...

    mean(mean(convolvedFeatures(i, j, rowbegin : rowend, columnbegin :

    columnend)));

    end

    end

    end

    end

  • 28

    We are pooling the features in the region poolDim x poolDim so that we are able to produce a

    matrix with dimensions numFeatures, numImages and floor(convolvedDim/poolDim) where a

    single value within this matrix corresponds to a single pooling region.

    After we are done with convolving and pooling all the features on each image it is again rec-

    ommended to save the result by:

    save('pooledFeatures.mat', 'pooledFeaturesTrain', 'pooledFeaturesTest');

    Now we are presented with a computationally feasible supervised classification task. We are

    going to use the pooledFeaturesTrain and pooledFeaturesTest to train a softmax classifier.

    softmaxLambda = 1e-4;

    numClasses = 4; %2

    % Reshape the pooledFeatures to form an input vector for softmax

    softmaxX = permute(pooledFeaturesTrain, [1 3 4 2]);

    softmaxX = reshape(softmaxX, numel(pooledFeaturesTrain) / numTrainImages,...

    numTrainImages);

    softmaxY = trainLabels;

    options = struct;

    options.maxIter = 200;

    softmaxModel = softmaxTrain(numel(pooledFeaturesTrain) / numTrainImages,...

    numClasses, softmaxLambda, softmaxX, softmaxY, options);

    Lets open the softmaxTrain function to see how it operates exactly.

    softmaxTrain.m

    function [softmaxModel] = softmaxTrain(inputSize, numClasses, lambda, inputData, labels,

    options)

    if ~exist('options', 'var')

    options = struct;

    end

    if ~isfield(options, 'maxIter')

    options.maxIter = 400;

    end

    % initialize parameters

  • 29

    theta = 0.005 * randn(numClasses * inputSize, 1);

    addpath minFunc/

    options.Method = 'lbfgs';

    minFuncOptions.display = 'on';

    [softmaxOptTheta, cost] = minFunc( @(p) softmaxCost(p, ...

    numClasses, inputSize, lambda, ...

    inputData, labels), ...

    theta, options);

    softmaxModel.optTheta = reshape(softmaxOptTheta, numClasses, inputSize);

    softmaxModel.inputSize = inputSize;

    softmaxModel.numClasses = numClasses;

    In case we did not provide the maximum iteration number in the options structure we are go-

    ing to perform 400 iterations. Then we randomly initialize the variable to be optimized, theta.

    We specify the optimizer to perform an l-BFGS search and we pass the function softmaxCost

    to the optimizer. After we are finished with the optimization we save the results into a struc-

    ture. But lets look into the cost function to gain some understanding of the softmax classifier:

    softmaxCost.m

    function [cost, grad] = softmaxCost(theta, numClasses, inputSize, lambda, data, labels)

    theta = reshape(theta, numClasses, inputSize);

    numCases = size(data, 2);

    groundTruth = full(sparse(labels, 1:numCases, 1));

    cost = 0;

    thetagrad = zeros(numClasses, inputSize);

    M = theta * data;

    M = bsxfun(@minus, M, max(M, [], 1));

    expM = exp(M);

    normTerm = 1./sum(expM,1);

    h = (repmat(normTerm',1,numClasses) .* expM')';

    probSum = 0;

    for i=1:numCases

    for j=1:numClasses

    probSum = probSum + groundTruth(j,i) * log(h(j,i));

    end

    end

  • 30

    cost = -1*(numCases^-1) * probSum;

    weightTerm = (lambda/2) * sum(sum(theta .* theta));

    cost = cost + weightTerm;

    % calulate the Hessian

    thetagrad = -1 * (numCases^-1) * (data * (groundTruth - h)')' + lambda*theta;

    % Unroll the gradient matrices into a vector for minFunc

    grad = [thetagrad(:)];

    end

    This function accepts the following parameters:

    numClasses - the number of classes

    inputSize - the size N of the input vector

    lambda - weight decay parameter

    data - the N x M input matrix, where each column data(:, i) corresponds to a single test

    set

    labels - an M x 1 matrix containing the labels corresponding for the input data

    The instruction M = theta * data makes M to contain the theta*x exponents for each class and

    each example. M = bsxfun(@minus, M, max(M, [], 1)) has the purpose to subtract the maximum

    of each theta*x vector at each example so we do not have large values that might cause

    overflow. expM = exp(M) computes the the exponents of the theta*x matrix. By normTerm =

    1./sum(expM,1) we first compute the normalizing term which makes the exponents sum to one.

    We then sequentially compute the probabilities where the two instructions

    h = (repmat(normTerm',1,numClasses) .* expM')' and

    probSum = probSum + groundTruth(j,i) * log(h(j,i))

    are important.

    Finally we compute the weight decay term by: weightTerm = (lambda/2) * sum(sum(theta .*

    theta)) and the cost will become:

    cost = -1*(numCases^-1) * probSum;

    cost = cost + weightTerm;

    The last step is to calculate the Hessian matrix by:

    thetagrad = -1 * (numCases^-1) * (data * (groundTruth - h)')' + lambda*theta

  • 31

    Finally we unroll the gradients into vector form by: grad = [thetagrad(:)];

    We are now ready to test our trained classifiers predictive capabalities on a seperate test set

    consisting our labeled images.

    softmaxX = permute(pooledFeaturesTest, [1 3 4 2]);

    softmaxX = reshape(softmaxX, numel(pooledFeaturesTest) / numTestImages, numTestImages);

    softmaxY = testLabels;

    [pred] = softmaxPredict(softmaxModel, softmaxX);

    acc = (pred(:) == softmaxY(:));

    acc = sum(acc) / size(acc, 1);

    fprintf('Accuracy: %2.3f%%\n', acc * 100);

    The softmaxX will represent the pooled test images and softmaxY will carry the labels

    associated with each test example. We pass into our softmaxPredict function the previously

    acquired classification model along with the test examples and it will return the predictions

    made on each example.

    softmaxPredict.m

    function [pred] = softmaxPredict(softmaxModel, data)

    theta = softmaxModel.optTheta;

    pred = zeros(1, size(data, 2));

    M = theta * data;

    M = bsxfun(@minus, M, max(M, [], 1));

    expM = exp(M);

    normTerm = 1./sum(expM,1);

    h = (repmat(normTerm',1,softmaxModel.numClasses) .* expM')';

    [y,i] = max(h);

    pred = i;

    end

    By using the [y,i] = max(h) expression we are effectively choosing the category that has the

    highest probability the example belongs to. Finally we are returning the vector containing the

    labels of our categories that we predicted for each example.

    Upon return we can easily compare the actual labels of the test examples and what our model

    has just predicted and finally print the accuracy of our prediction:

    [pred] = softmaxPredict(softmaxModel, softmaxX);

    acc = (pred(:) == softmaxY(:));

  • 32

    acc = sum(acc) / size(acc, 1);

    fprintf('Accuracy: %2.3f%%\n', acc * 100);

    e) Image database construction

    I had three ways of collecting images during my research. The first one was very

    straightforward. I used the STL-10 dataset provided by a link on the web site of the UFLDL

    tutorial. The goal was to ensure that my implementation worked just as in the lecture notes.

    After ensuring my implementation was working well I decided to try the algorithm on a

    different dataset. My strategy was to randomly download massive amount of images from the

    Internet as implied in (Raina et al., 2007). I chose Flickr as my search engine for images. The

    dataset for the unsupervised sparsity auto-encoder algorithm was compiled by searching for

    images with the keyword life. I successfully downloaded in this general category about

    2000 unlabeled images. This dataset was used to create the activation features to be used

    later in the softmax classification task.

    I have sampled images in four categories: car, cat, tiger, tree. I have divided each category

    into a training and a test set to eliminate any sampling bias. Each training set consisted around

    40-50 images in the given category. Each test set contained around 15-20 images. (This

    examplifies how difficult it is to get labeled images contrasted with just simply downloading

    random images.)

    By training and testing I was able to achieve a ~85% accuracy rate. If I used random weights

    instead of the optimized activation features in the softmax training the predicition rate

    dropped to ~57%! It was clear that the sparse auto-encoder algorithm contributed a significant

    improvement to the prediction performance.

    The third way to acquire images was to download images from the ImageNet database. I have

    dowloaded the 2011 Fall dataset which consisted around 14 million direct image urls. I wrote

    a small Python script that I used to dowload randomly sampled images.

    import requests

    import numpy as np

    imagesDownloaded=0

    index = 0

  • 33

    file_database = open("fall11_urls.txt", "r")

    lines = file_database.readlines()

    numOfLines = len(lines)

    randomIndexes = np.random.permutation(range(numOfLines))

    outFolder = "2011Fall"

    while imagesDownloaded

  • 34

    os.rename(filePath, newPath)

    jpegCount = jpegCount + 1

    elif (imgType == "" or imgType == None):

    os.remove(filePath)

    noTypeCount = noTypeCount + 1

    else:

    os.remove(filePath)

    otherTypeCount = otherTypeCount + 1

    numFilesChecked = numFilesChecked + 1

    print numFilesChecked

    print "Jpeg count: ", jpegCount

    print "Other type: ", otherTypeCount

    print "No type: ", noTypeCount

    f) Enhancements of the standard algorithm

    I have made several enhancements to the main algorithm proposed by the Stanford class. I did

    the enhancements mainly to increase the memory efficiency of the learning process.

    In the following section I will enumerate the changes I introduced into my implementation:

    Compile the raw images into batches and save them into the standard MATLAB

    format. By introducing this feature, one is capable of working with the images in batch

    iterations, meaning it is not necessary to have all the images in RAM all at once. It

    enabled me to create a much larger sample size.

    Training the neural network with limited memory batch processing. To be fair this

    was a recommendation by the UFLDL site but they did not elaborate much on the

    process. By training the network in batches one does not need to read all the input into

    RAM but train and optimize the network in successive iterations. This is on one hand

    more efficient since the load on RAM is less and finding the optimal weights is faster

    (the weights coverge faster) on the other hand it is less efficient since a redundant

    factor has been introduced which is the needs to calculate the feedforward activations

    twice to get the average activations to calculate the back propagation step.

    Vectorized forms of all the computationally expensive algorithms. It makes the

    code more difficult to read and comprehend but on the other hand MATLAB can work

    with vectorized forms very efficiently due to the sophisticated matrix manipulation

    libraries.

  • 35

    Testing the algorithm on real-life datasets. By this I mean that I simply tried to

    download images from the Internet. I have created categories, transformed the images

    to verify the validity of the algorithm.

    g) Experiments

    After running the sparse auto-encoder on the STL-10 sampled patches (100,000 - 8 x 8 pixel

    patches) I was able to produce a visually very similar feature activation to the one shown on

    the tutorials web site.

    Dataset Number of categories Method Prediction Accuracy

    STL-10 4 Raw (random

    weights without pre-

    training)

    ~58%

    STL-10 4 Sparse auto encoder ~81%

    The above tables shows the effect of pre-training the weights by the unsupervised learning

    step. We can achieve much better classification performance if we support the supervised

    learning task by running an unsupervised algorithm.

    My experience with different datasets than the STL-10 enforced the validity of the self-taught

    learning method even though I was never able to achieve better classification performance

    than 85%.

  • 36

    Data set Unlabeled data

    set size

    Labeled data Categories Accuracy

    www.flickr.com 2000 200 4 (car, cat, tiger,

    tree)

    85%

    www.flickr.com 2000 (same as

    above)

    400 2 (male face,

    female face)

    72%

    ImageNet 15000 200 4 (car, cat, tiger,

    tree)

    81%

    The above table represents my experiments with custom datasets. I wanted to see whether the

    algorithm works well with images compiled by a user simply by randomly dowloading

    images. The algorithm works well differentiating significantly different objects, like cats

    from trees or trees from cars. But it has serious problem deciding between cats and tigers for

    instance. I observed the same issue with the STL-10 dataset. It couldnt really tell whether it

    saw a cat or a non-cat mammal.

    4. Conclusion

    Self-taught learning holds a great promise for future research and classification tasks where

    we do not have access to a plethora of labeled images since it is much easier and cheaper to

    simply acquire tons of unlabeled images.

    Futher direction to my research will be to change the underlying architecture and its

    parameters. When I tried to use bigger datasets I was not able to show significantly greater

    performance thus I suspect one promising way forward is to have stacked auto-encoders. The

    idea is nicely presented at the UFLDL site6.

    My next step in my further development and research will be to expand the algorithm with

    stacked auto-encoders. By having only a single hidden layer one can only extract edges

    from the images, working effectively as an edge detector. Inserting multiple hidden layers

    could extract deeper underlying features from images. They could detect longer contours, or

    6 http://ufldl.stanford.edu/wiki/index.php/Stacked_Autoencoders

  • 37

    perhaps detect simple "parts of objects." An even deeper layer might then group together these

    contours or detect even more complex features7.

    5. Acknowledgments

    At this section I would like to express my gratitude to my supervisor, Balint Antal who

    guided me at my research and presented to me the idea of self-taught learning. His help and

    his guidance immensely determined the direction and the success of my research in the topic.

    I would like to thank Andrew Ng for his great lectures at Coursera where I was first exposed

    to the ideas and techniques of machine learning in general. His course on Machine Learning

    freely available on Youtube and his handouts greatly helped me to learn the fundamentals of

    machine learning. His UFLDL tutorial on deep learning ideas along with implementation tips

    should be the standard of education at every institute.

    7 http://ufldl.stanford.edu/wiki/index.php/Deep_Networks:_Overview

  • 38

    References

    Rajat Raina, Alexis Battle, Honglak Lee, Benjamin Packer, Andrew Y. Ng, Self-taught

    Learning: Transfer Learning from Unlabeled Data, 2007

    Bruno A Olshausen, David J Field, Sparse coding of sensory inputs, Current Opinion in

    Neurobiology, 14:481487, 2004

    Honglak Lee Alexis Battle Rajat Raina Andrew Y. Ng, Efficient sparse coding algorithms,

    2006

    Yoshua Bengio, Aaron Courville, and Pascal Vincent, Representation Learning: A Review

    and New Perspectives, 2012

    Tom M. Mitchell, Machine Learning, March 1, 1997

    Ethem Alpaydn, Introduction to Machine Learning, 2nd ed, The MIT Press Cambridge,

    Massachusetts London, England, 2010

    Hinton, G.E. Supervised learning in multilayer neural networks in The MIT Encyclopedia of

    the Cognitive Sciences Editors: Robert A. Wilson and Frank C. Keil The MIT Press, 1999

    Rumelhart, D. E., Hinton, G. E., and Williams, R. J. Learning representations by back-

    propagating errors. Nature, 323, 533--536, 1986

    Nigam, K., McCallum, A., Thrun, S., & Mitchell, T. Text classification from labeled and

    unlabeled documents using EM. Machine Learning, 39, 103,134., 2000

    B. A. Olshausen and D. J. Field. Emergence of simple-cell receptive field properties by

    learning a sparse code for natural images. Nature, 381:607609, 1996.

    B. A. Olshausen and D. J. Field. Sparse coding with an overcomplete basis set: A strategy

    employed by V1? Vision Research, 37:33113325, 1997.

    M. S. Lewicki and T. J. Sejnowski. Learning overcomplete representations. Neural Comp.,

    12(2), 2000.

    B. A. Olshausen. Sparse coding of time-varying natural images. Vision of Vision, 2(7):130,

    2002.

    B.A. Olshausen and D.J. Field. Sparse coding of sensory inputs. Cur. Op. Neurobiology,

    14(4), 2004.

    MathWorks Inc, Neural Network Product Help

  • 39

    Sources from the Internet:

    http://ufldl.stanford.edu/wiki/index.php/UFLDL_Tutorial (2013. november 05.)

    http://ufldl.stanford.edu/wiki/index.php/Autoencoders_and_Sparsity (2013. november 05.)

    http://ufldl.stanford.edu/wiki/index.php/Neural_Networks (2013. november 05.)

    http://ufldl.stanford.edu/wiki/index.php/Backpropagation_Algorithm (2013. november 05.)

    http://ufldl.stanford.edu/wiki/index.php/Softmax_Regression (2013. november 05.)

    http://ufldl.stanford.edu/wiki/index.php/Self-Taught_Learning (2013. november 05.)

    http://ufldl.stanford.edu/wiki/index.php/Linear_Decoders (2013. november 05.)

    http://ufldl.stanford.edu/wiki/index.php/Feature_extraction_using_convolution (2013.

    november 05.)

    http://ufldl.stanford.edu/wiki/index.php/Pooling (2013. november 05.)

    http://ufldl.stanford.edu/wiki/index.php/Stacked_Autoencoders (2013. november 05.)

    http://ufldl.stanford.edu/wiki/index.php/Deep_Networks:_Overview (2013. november 05.)