Content-Based Image Retrieval for Medical Applicationsinescporto.pt/~jsc/students/2010IgorAmaral/2010... · 1.1.3 Concept-based retrieval limitations: the road to CBIR 4 1.2 Content

Igor Francisco Areias Amaral

Content-Based Image Retrieval for

Medical Applications

Faculdade de Ciências da Universidade do Porto

Outubro de 2010


Content-Based Image Retrieval for

Medical Applications

Tese submetida à Faculdade de Ciências

da Universidade do Porto para a obtenção do grau

de Mestre em Engenharia Matemática

Dissertação realizada sob a supervisão

de

Prof. Doutor Jaime dos Santos Cardoso (INESC-Porto)

e de

Prof. Doutor Joaquim Fernando Pinto da Costa (DMA-FCUP)

Porto, Outubro de 2010

To my parents, José and Maria

Acknowledgments

It is finally done. These first words you read were the last to be written in this

document. Looking back, this work reflects a hard learning process. Now, in the end,

what I feel that I know little. I learned that there is much more to learn. However I was

never alone during this task.

With the help of my thesis supervisors, Professors Jaime Santos Cardoso and

Joaquim Pinto da Costa, I was able to acquire the necessary motivation and knowledge

to achieve my goals. They provided the freedom to pursue my own ideas and, at the

same time, were rigorous in reviewing my work. For that, and for granted me the

opportunity in such a remarkable research field, I am thankful. They had a fundamental

role in bringing this document to life.

While at INESC, where this work was developed, I also had the opportunity to meet

amazing people and make new friends. They not only made many small contributions to

this work, during our informal talks at lunch or coffee breaks, but are also responsible

for the amazing atmosphere inside the institution.

My family was also very important to me these last months, specially my parents

who were supportive in every occasion, giving me all the chances I had.

For last Cristina, who taught me, during these last years, the importance of having

someone waiting for you.


October, 2010

Abstract

Advances in digital imaging technologies and the increasing prevalence of picture

archival systems have led to an exponential growth in the number of images generated

and stored in hospitals during recent years. Thus, automatic medical image annotation

and categorization can be very useful for the purposes of image database management.

Conventional image retrieval systems are based on textual annotation where key

information about the image is stored. In medical images it forms an essential

component on a patient’s record. However, in many occasions this information is very

often lost as consequences of image compression or human error. Also, given the

number of different standards adopted for medical image annotation, building a

comprehensive ontology regarding medical terms is not always consensual. Recently,

advances in Content Based Image Retrieval prompted researchers towards new

approaches in information retrieval for image databases. In medical applications it

already met some degree of success in constrained problems.

This document addresses the problem of medical image annotation relying only in

pictorial information, where images are classified by means of a hierarchical standard.

We present a comprehensive survey of related works and a description of the

mathematical tools used to achieve our goals. Our methodology consists in the use of

commonly approaches to this problem as well as an implementation our own ideas,

aiming to explore the hierarchical nature of the standard used for annotation.

Afterwards, we improve our initial results by means of two merging strategies and

provide an interpretation for our results.

Keywords: medical images, image descriptors, classification, support vector machines.

Resumo

Avanços na tecnologia compreendendo imagens digitais bem como um aumento na

utilização de sistemas de armazenamento de imagens levaram nos anos recentes a um

crescimento do número de imagens geradas e arquivadas no meio hospitalar. Como

consequência, a anotação e categorização automática de imagens médicas pode ser

bastante útil para a manutenção de bases de dados.

Abordagens convencionais a sistemas de recuperação de imagem baseiam-se em

anotações textuais onde informação crucial sobre o conteúdo da imagem é guardada. No

entanto, esta informação é frequentemente perdida como consequência da compressão

de imagem ou erro humano. Adicionalmente, dado o número de diferentes normas

adoptadas para anotação de imagens médicas, a construção de uma ontologia

compreendendo termos médicos nem sempre é consensual.

Recentemente, avanços na recuperação por conteúdo de imagens impulsionaram

investigadores para novas abordagens na extracção de imagens em bases de dados. Em

aplicações médicas existe já um relativo sucesso para problemas específicos.

Este documento aborda o problema de anotação de imagens médicas baseado apenas

em informação visual, onde imagens são anotadas segundo uma norma hierárquica. Será

apresentado um resumo compreendendo trabalhos relacionados e uma descrição das

ferramentas matemáticas usadas para alcançar os objectivos propostos. A nossa

abordagem consiste em métodos geralmente usados no mesmo problema bem como a

implementação de novas estratégias desenvolvidas com o objectivo de explorar a

hierarquia da norma usada para anotação. Mais tarde, através métodos para fusão de

anotações, melhoraremos os resultados iniciais, seguido-se uma interpretação dos

mesmos.

Palavras chave: imagens médicas, descritores de imagem, classificação, máquinas de

suporte vectorial.

Contents

Introduction

1.1 Motivation 1

1.1.1 Concept-bases systems 2

1.1.2 Medical Image standards and ontologies 3

1.1.3 Concept-based retrieval limitations: the road to CBIR 4

1.2 Content –based image retrieval 5

1.2.1 CBIR Systems 6

1.2.2 Smeudlers CBIR paradigm formalization 8

1.2.3 CBIR Future work overview 9

1.2.4 CBIR in Medical Applications 11

1.3 Structure of this document 12

1.4 Goals 12

1.5 Main Contributions 13

Related Work

2.1 The IRMA code 15

2.2 Error evaluation for the IRMA code 17

2.3 ImageCLEF Medical Evaluation Tasks 20

2.3.1 2005 Medical Annotation Task 22

2.3.2 2006 Medical Annotation Task 27

2.3.3 2007 Medical Image Annotation Task 32



2.4 Other IRMA database related work 44

Background information

3.1 The image domain 45

3.1.1 Image properties 46

3.1.1.1 Color 46

3.1.1.2 Shape 47

3.1.1.3 Texture 47

3.1.1.4 Interest Points 48

3.1.2 Image descriptors 48

3.1.2.1 Tamura textures 50

3.1.2.2 Edge Histogram Descriptor (EHD) 52

3.1.2.3 Color Layout Descriptor(CLD) 54

3.1.2.4 Scalable Color Descriptor (SCD) 55

3.1.2.5 Color and Edge Directivity Descriptor (CEDD) 56

3.1.2.6 Fuzzy Color and Texture Histogram (FCTH) 57

3.1.2.7 Spatial Envelope (GIST) 58

3.1.2.8 Speeded Up Robust Features (SURF) 59

3.2 Support Vector Machine (SVM) 60

3.3 2007 Medical Annotation Task database 67

Methodology

4.1 Framework Description 69

4.1.1 Feature Extraction 70

4.1.1.1 Global Descriptors 71

4.1.1.2 Bag-of-words model 72

4.1.2 Model Training and Image Annotation 72

4.1.3 Methods Fusion 73

Results

5.1 Feature Extraction 77

5.2 Annotation 78

5.3 Semantic Meaningless Codes 84

5.4 Fusion 85

Conclusions and Future Work 89

References 91

1

Chapter 1

Introduction

1.1 1.1 1.1 1.1 MotivationMotivationMotivationMotivation

HE image is probably one of the most important tools in medicine since it provides a

method for diagnosis, monitoring drug treatment responses and disease management of

patients with the advantage of being a very fast non-invasive procedure, having very few side

effects and with an excellent cost-effect relationship.

Hard-copy image formats, i.e., analog screen films, were the initial support for medical

images but they are becoming rarer. Maintenance, storage room and the amount of material to

display images in this format contributed for its disuse. Nowadays digital images, the soft-copy

format, lack the previous mentioned problems while offering the possibility of text annotations

in metadata format. Table 1.1 gives an overview of digital types, sizes and number of images

per exam in medical imagiology. Curiously, this transition from hard-copy to soft-copy images

is still the focus of an interesting debate related with human perception and interpretation issues

during exam analysis [1].

Exam Type One Image

(bits)

# of

Images/Exam

One

Examination

Nuclear medicine (NM) 128x128x12 30-60 1-2 MB

Magnetic resonance imaging (MRI) 256x256x12 60-3000 8 MB up

Ultrasound (US)* 512x512x8 20-240 5-60 MB

Digital subtraction angiography (DS) 512x512x8 15-40 4-10 MB

Digital microscopy 512x512x8 1 0.25 MB

Digital color microscopy 512x512x24 1 0.75 MB

Color light images 512x512x24 4-20 3-15 MB

Computed tomography (CT) 512x512x12 40-3000 20 MB up

Computed/digital radiography (CR/DR) 2048x2048x12 2 16 MB

Digitized X-rays 2048x2048x12 2 16 MB

Digital mammography 4000x5000x12 4 160 MB

*Doppler US with 24 bit color images

Table 1.1 – Types and sizes of some commonly used digital medical images (From [2]).

T

2

With the increase of data storage capacity and the development of digital imaging devices, to

increase efficiency and produce more accurate information, a steady growth of the number of

medical images produced can be easily inferred. A good example of this trend is the Radiology

Department of the University Hospital of Geneva where, alone, produced from 12.000 medical

images a day in 2002 [3] to 50.000 medical images a day in 2007 [4]. The main contributions

for these numbers are video frames from cardiac catheterizations and endoscopies. Aside the

obvious usefulness of medical images, patient diagnosis and treatment, this huge amount of data

also provides an excellent resource for researchers in the medical field.

1.1.11.1.11.1.11.1.1 ConceptConceptConceptConcept----based systems based systems based systems based systems

With the exponential increase of medical data in digital libraries, it is becoming more and

more difficult to execute certain analysis on search related tasks. Because textual information

retrieval is already a matured discipline, a way to overcome this problem is to use metadata for

image indexation where key description about its content and context can be stored. For medical

images we could store, for instance, patient identification, type of exam and its technical details

or even a small text comment concerning clinical relevant information. With this information

annotated, text-matching techniques can be applied for retrieving images satisfying a given

search statement mediated by a thesaurus, performed by evaluating the similarity between the

search statement and the metadata. Output evaluation can motivate a later thesaurus expansion,

new rules for validation and matching or a new search statement. This is called text-based or

concept-based image retrieval. A schema for this type of systems is depicted in Figure 1.1.

Figure 1.1 – A basic diagram representing a concept-based image retrieval system (From [5]).

3

Concept-based systems can be traced back, in a much wider domain, to the end of the 1970’s

accordingly to Rui [6], and are still used in photo and video sharing websites like Flickr1,

Google image search2 or YouTube

3.

1.1.21.1.21.1.21.1.2 Medical image standards and ontologiesMedical image standards and ontologiesMedical image standards and ontologiesMedical image standards and ontologies

To foster the concept-based approach, a nomenclature of medical terms together with a

relational or hierarchical model - a standard - is needed to bridge the content of the medical

image and its context. Also, standards regarding image compression formats, database

programming languages and network protocols are essential, as they provide mutual

understanding of users with different backgrounds in user-machine environments as well as

interchangeability of data via machine-machine protocols.

The ARC-NEMA standard for medical images was first developed in the 1980’s by a joint

venture between the American College of Radiology (ACR) and the National Electrical

Manufacturers Association (NEMA). Later, in 1992, after the inclusion of network protocols

and numerous glossary revisions, ARC-NEMA was renamed Digital Image and

Communications System4 (DICOM) and is the most common standard used for specifying

components of a medical imaging system. Other standards like SNOMED5, MeSH

6, HL7

7,

GALEN8, ICD-10

9 and UMLS

10 were also developed alongside with other type of solutions that

define interoperability between them: the IHE11

uses DICOM/HL7 for internal/external

communications without being a standard itself. The “order entry” issues, related with specific

information demanded by law and only an optional part of the DICOM header, also led to the

development of the Japanese JJ1017 standard [7]. In Japan the medical environment works with

more detailed information not fully covered by the DICOM standard. After a failure in trying to

change the DICOM standard to suit these needs, Japan advanced to its own system as an

extension of DICOM.

The degree to which the ontology of any standard can be a transparent representation of the

content underlying medical images is questionable. Understanding ontology as a formal way to

codify semantics that are representative of a reason, we face the difficulty to choose an adequate

1 http://www.flickr.com

2 http://images.google.com

3 http://www.youtube.com

4 http://medical.nema.org

5 http://www.snomed.org

6 http://www.nlm.nih.gov/mesh/meshhome.html

7 http://www.hl7.org

8 http://www.opengalen.org

9 http://www.who.int/classifications/icd

10 http://www.nlm.nih.gov/research/umls

11 http://www.ihe.net

4

terminology that captures the meaning of the image. Very often the problem is reversed when

such terminology is already well defined but the concepts that we are trying to represent become

subject of attention [8]. This is particularly evident in Emotional Information Retrieval (EmIR)

[9]. Furthermore, meaning is not a well defined quantifiable attribute, but, as Heidorn defines

[10] a property ascribed by human analysis of the image bringing to bear a combination of

objective and subjective knowledge in a sociocognitive process. Then, in one hand, words can

be used to denotate the image content if its meaning is straightforward and literal, which is not

very usual. On the other hand, if the image content can be connoted with different layers of

knowledge then words are not enough to describe its meaning [8].

1.1.31.1.31.1.31.1.3 ConceptConceptConceptConcept----based retbased retbased retbased retrieval limitationsrieval limitationsrieval limitationsrieval limitations:::: the road to CBIRthe road to CBIRthe road to CBIRthe road to CBIR

In practice the conceptualization of a general thesaurus of medical terms consume many

resources and demands extensive collaboration efforts where consensus is hard to reach. It is

reasonable to use inductive approaches by starting with more specific standards and attempt

generalization later. In the composite SNOMED-DICOM micro-glossary [11] such a strategy is

used. Nevertheless, all standards presented are not ineffectual since they are used in several

Picture Archive and Communications Systems (PACS). Facing the amount of images in a

database, annotation by human hand can be a time consuming and cumbersome task where

perception subjectivity can lead to unrecoverable errors. A study of medical images using

DICOM headers revealed 15% of annotation errors from both human and machine origin [12].

The amount of different languages that can be used for annotation is extensive and may lead to

translation/interpretation errors during a search statement or when databases are merged. It is

convenient to be aware of the prospect of re-indexing images due to the presence of an event

that changes the importance of a particular aspect, e.g., Forsyth’s previously unknown famous

person photos [13], or the need to link the content of the image to a new search statement

possibility, e.g., Seloff’s engineer search for a misaligned mounting bracket existent only in a

annotated astronaut training image [14]. From the foregoing it is clear the concept-based image

retrieval pose too many problems both from the ontology point of view, as stated in the previous

section, and from a practical point of view. Another major obstacles for concept-based image

retrieval systems are the existence of homographs and the fact that the search statement, or

query, does not allow the user to switch and/or combine interaction paradigms [15] during text

transactions. The ideal system would relieve the human factor from the annotation task, by

doing it automatically, and allowing image retrieval by its content in its purest form, not by text

description. This is Content Based Image Retrieval (CBIR).

5

1.2 Content1.2 Content1.2 Content1.2 Content Based Image RetrievalBased Image RetrievalBased Image RetrievalBased Image Retrieval

During our lives, and since a very early age, we have the ability to easily recognize

thousands of objects in many different conditions. Trying to understand how we do it is a deep

and complex subject. Pre-iconographic, iconographic and iconological formalism proposed by

Panofsky [16], generalized by Shatford [17] and extended by Shatford-Layne [18] provide the

notion that an image is not a single unit but an amalgam of generic, specific and abstract

content [16] where it may be necessary to determinate which attributes will result in useful

groupings of images and which attributes should be left for the user to identify [18]. The pre-

iconographic elements imply a simple identification through familiarity and representative of a

very low level knowledge related to human abstraction, but enough to comprehend some factual

information within the image. Pure forms, like volumes and lines, and their disposition are at

this level. The iconographic elements attempt to describe a motif or groups of motifs associated

with the pre-iconographic level and can imply a statistical procedure to identify those important

or unimportant, depending on their role in the image. Iconological interpretation is the highest

level of knowledge that can be extracted from the image and it results from grouping the pre-

iconographic and iconographic interpretations together with reasoning: it is the symbolical value

of the image [16]. The experience of the individual plays a role at all stages of the formalism,

exerting influence in the ability to group content based on the image attributes [17]. Shatford

generalization of Panofsky work comprehends only the first two levels, pre-iconographic and

iconographic, replacing these by of and about relational sentences. Therefore, an image can be

of a generic/specific person, animal, thing, action, condition, place or time of day, etc, about

abstractions symbolized by objects of beings, actions, events, places or time [18]. While not

rejected, real applications of such theoretical models were experimented by Enser but met little

success due to the dichotomous character of queries made by users in a concept-based retrieval

system [19].

We can attempt some simplification by considering a two-step process: first we retrieve

information from what we see; second we categorize the scene and the objects within using a

previous cognitive process. If we define images as two dimensional representations of our three

dimensional reality then the same process holds. But what is the image content? There is no

precise answer to this question. Nonetheless, relationships between image properties like color,

shape, texture and interest points are certain to be fundamental for its characterization.

The goal of CBIR is to replicate this human ability of object recognition using a similar two-

step process: use of quantified measures from the image that are believed to represent color,

shape, texture and interest points - the image descriptors - as an approach to human perception;

use of machine learning techniques, to create a model for the data, or similarity measures, to

6

interpret the image in order to establish the difference of two elements or groups of elements as

an approach to human cognition.

1.2.11.2.11.2.11.2.1 CBIR systemsCBIR systemsCBIR systemsCBIR systems

In a typical CBIR system (Figure 1.2) the input from the user consists in one or more images,

a test set. Pictorial content is then extracted into image descriptors and stored in the form of

feature vectors. In the system there is a database of images, a training set, where the

information extraction already took place and was used to choose the best models and/or

similarity measures for comparison. With the help of this models and similarity measures the

test set is indexed and/or similar images are retrieved. Relevance feedback take into

consideration the results and act by weighting or ranking feature vectors to discriminate their

importance; decide which image descriptors are relevant or not for the query; change models

and/or similarity measures; provide new model training and/or similarity measures definitions;

perform a new query. Human interaction in CBIR systems can also be used as an integral part of

it at this stage, not only when automatic methods fail. From a user perspective a CBIR system

should meet, accordingly to Chang [15], the general requirement of timely delivery and easy

accessibility of image and associated information for the user, at a resolution appropriated for

the intended task(s).

Figure 1.2 – A scheme of a typical CBIR system. Relevance feedback can be accomplished

using human interaction (From [20]).

The first theoretical CBIR systems designs appeared in 1987 [21]. The first prototype of

CBIR system appeared five years later, in 1992, and was developed by T. Kato [22] for an

7

electronic art gallery containing 205 pictures of paintings. Kato is also credited to be the first to

use the CBIR term [23]. In his system information extraction was performed by an adaptive

filter, based on the Weber-Fechner law for the human vision mechanism, to capture global and

local edge points. With this form of image abstraction, Query by Visual Example (QVE)

algorithms, based in correlation between the image query and the database, were employed for

the retrieval process. The best correlation was set as a similarity measure to match image

candidates to the given query.

The first commercial release of a CBIR system, the IBM Query by Image Content (QBIC)

[24], took place in 1995 and swayed the nature of future frameworks. Surveys of CBIR systems

can be found in Aigrain [25], Eakins [23] and Rui [6].

Historically CBIR is a relatively recent research area but with numerous and diverse

application fields, well resumed in Eakins [23]. Smeudlers [26] points the lack of computational

capacity, digital imaging devices and an underdeveloped Internet as the main causes that

hampered serious research attempts in this area before 1995. Aigrain [25] criticizes precisely the

fact that at this time too much effort was being placed at information systems and not in the

content processing. However, it is consensual that the lack of communication between retrieval

research and databases systems contributed for a slower development of CBIR, perceived as

soon as 1979 in the Conference on Database Applications of Pictorial Applications [26] and

remaining heretofore unrelated. Datta [27] also states that effective means of indexation were

overshadowed by the research of efficient visual representations and similarity measures. This

opinion slightly contradicts Rui in [6], where he states the stimulus given by the introduction of

the wavelet transform in the early 1990’s had an impact in the growth of the number of

available image descriptors and, consequently, motivated the appearance of CBIR systems.

Other forms of image information retrieval, already established at this time, for shape, color and

texture have found in CBIR extensive use, leading insofar to the first Moving Picture Experts

Group (MPEG) standards in 1992. During this decade image descriptors also started to focus

not on general information about the whole or partitioned image but in interest points aiming to

capture higher level information. Such type of image descriptors were mainly influenced by the

works of Harris for corner detection [28] and Lindberg for blob detection [29]. One of the major

achievements in this type of descriptors was most probably achieved in 1999 when Lowe

presented the Scale-Invariant Feature Transform (SIFT) [30], inspiring the research for other

affine transformation detectors [31], invariant to certain image conditions. The extent to which

these interest points could be used suffered a change when computer vision borrowed the word

frequency analysis from text-search operations in the so called bag-of-words or bag-of-features

models [32] around 2003, an approach that up for today is still gradually being unfolded with

the help of machine learning. Undoubtfully, one of the major contributions for CBIR was the

Internet boom and the arrival of the first Internet browsers around 1995, which demanded

8

urgent tools to retrieve information that suddenly could be accessed. Datta [27] verifies an

exponential growth of the number of scientific papers made available by three main publishers

from around 150 in 1995 to 1200 in 2008.

1.2.21.2.21.2.21.2.2 Smeudlers CBIR paradigm formalizationSmeudlers CBIR paradigm formalizationSmeudlers CBIR paradigm formalizationSmeudlers CBIR paradigm formalizationδδδδ

Notwithstanding all the various contributions in the previous section, a proper formalization

of the whole paradigm of CBIR is a necessity as otherwise would be hard to develop mission

critical applications or claim the much needed level of consistency and integrity of a recently

independent research field. Somehow this remained ignored by early implementations that did

not demand any kind of full understanding on the field, thus overlooking a broader overview of

the problems involved and a better refinement of CBIR systems components. Only in 2000

Smeudlers presented a deep review towards formalism for CBIR that influenced researchers

until the present time and will continue to influence in the future.

When CBIR systems are used for image extraction or annotation the output often does not

satisfy the given query. Smeudlers calls this problem the semantic gap or the lack of

coincidence between the information that one can extract from the visual data and the

interpretation that the same data have for a user in a given situation and justifies this behavior

with the difficulty to connect high-level concepts associated to the image to low-level content in

data-driven features. Considering that computers use numerical information only, when an

image is converted into digital format it is important to be aware of how much information is

lost during the process. This is the sensory gap or the gap between the object in the world and

the information in a numerical/verbal/categorical derived from an image recording of that

scene. Missing information can derive from cluttering, illumination conditions, occlusion,

distortion, differences in camera viewpoint or any other extra accidental elements in the image.

The sensorial gap is closely related with the variability of the image content, which

Smeudlers categorizes in two opposite domains: the narrow domain if the image has a limited

and predictable variability in all relevant aspects of its appearance and the broad domain if the

image has an unlimited and unpredictable variability in its appearance even for the same

semantic meaning. The distinctions between image domains play an important role during CBIR

systems design. Professional applications are usually domain-specific, dealing with narrow

domain images for object recognition or a quantitative objective description of its content.

Public applications use larger databases with broader domain images towards generic

applications for qualitative information retrieval.

δ In order to avoid excessive citations of the same source, any definitions in italic presented inside this

section can be found in [26]

9

Formalism for the query was also formulated for the capture of the essence of the user

intention. If the user has no specific aim for the query then a search by association is performed.

Systems to satisfy this requirement use iterative refinement for the given examples, thus being

very interactive. If the search is made for objects belonging to a certain category then we have a

category search. Systems for category searches rely on similarity measures that characterize an

image as part, or not, of a certain category. If the goal of the user is to search for a precise copy

image it is said that the user targets the search or performs an aimed-search, where the system

must search images from the specific example. Depending on the query intention, Datta [27]

defines the user as a browser, surfer and searcher respectively. With image domain and user

intention definitions, Smeudlers reformulates the goal for CBIR systems in the following way:

the challenge for image search engines on a broader domain is to tailor the engine to the

narrow domain the user has in mind via specification, examples and interaction.

1.2.31.2.31.2.31.2.3 CBIR future work overviewCBIR future work overviewCBIR future work overviewCBIR future work overview

There is room for development of CBIR in many directions. CBIR systems for narrow

domain images achieved a good degree of success; still, as the variability of images grows in

larger datasets, the problem grows deeper. From what was so far presented it is possible to point

some possible trends:

• Concept-based systems for image retrieval should not be ignored. Even if somehow

independent from CBIR, both approaches can complement each other in hybrid systems

where integration of natural language and computer vision take place. Aigrain [25]

mentions that this can help to capture rich semantic content of the image like names,

places, actions or prices.

• A higher understanding of what the user pretends from the information available is also

fundamental for any further work. For medical images, a study of what a doctor is looking

for when examining an image for diagnosis can be found in [33].

• From a database perspective there is also a lot of work to be done since the

developments made so far on this field are currently ill related with the developments in

CBIR, being more targeted for an increase in capacity rather than information organization

for future retrieval. Meanwhile a good interdisciplinary relationship for CBIR research is

slowly being established between areas like machine learning, multimedia, computer

10

vision, information retrieval and human-computer interaction accordingly to Datta in 2007

[27], a need previously expressed by Cawkell in 1993 [34] and Rui in 1999 [6].

• Today, digital image representations consist mostly in color models like the additive

Red-Green-Blue (RGB), the Hue-Saturation-Value (HSV) or the grayscale. The

dependence on these color systems also raises the question if they are sufficient to provide

information about the image. Specific color systems, like Tint-Saturation-Luminance

(TSL) for face detection, are to be considered as a potential solution for specific problems

in CBIR.

• A better interpretation of semantic image similarity is also needed for new metrics in

similarity measures since these degrade when databases grow and are usually domain-

specific. Smeudlers defends the search of similarity outside the scope of histogram

similarity [26].

• To counter the semantic gap, the linkage of low-level visual features to high-level

semantic meaning, more efforts should also be placed in the research of additional

descriptors for a better characterization of the image, thus allowing, at the same time, a

decrease in the sensorial gap. Image descriptors invariant to illumination conditions,

distortions, clutter, occlusion, etc, would reduce data from a broader to a narrower domain,

satisfying Smeudlers definition for the purpose of a CBIR system.

• New interface designs for user-machine interaction. Jain makes an original observation

about this subject in his blog1, stating the ‘simplistic’ fear from developers to produce

simple and useful systems rather than complex designs and extending this observation to

the academics, criticizing the excess of jargon to obfuscate their ideas.

• One of Smeudlers concluding remarks [26] points the necessity to classify usage-types,

aims and purposes to clearly evaluate if a proposed system solves a particular problem or

just perform better than a previous system.

• New, and more, general or domain-specific public databases.

Not all the future work possibilities in CBIR are stated, as they are very extensive, but only a

general idea of growth directions. It is worthy to point one last aspect in CBIR that may be an

1 http://ngs.ics.uci.edu , December 18

th 2009 entry.

11

important future work area and is largely ignored in CBIR surveys. As was already understood,

there was lack of proper formalism in the paradigm of CBIR until the valuable contribution of

Smeudlers. Withal, Smeudlers never refers any human cognition theoretical model for vision,

like Panofsky/Shatford work, or any implementation attempt of such models like Enser in his

survey [35]. It seems that non-ambiguous level-type knowledge theoretical models to be liable

for integration in CBIR are above any formalism definitions and may possess a key role to

understand how high-level concepts can be constructed by grouping low-level content.

1.2.41.2.41.2.41.2.4 CBIR in medical applicationsCBIR in medical applicationsCBIR in medical applicationsCBIR in medical applications

CBIR in the medical field also presents a growing trend in publications [36]. Although the

number of experimental algorithms comprehending specific problems and databases face a

growth its reflection on the number of medical applications and frameworks is still very

constrained. Only a few systems exist with relative success. The CervigramFinder system [37]

was developed to study the uterine cervix cancer. It is a computer assisted framework where

local features from a user-defined region in an image are computed and, using similarity

measures, similar images are retrieved from a database. The Spine Pathology & Image Retrieval

System (SPIRS) [38] is a web-based hybrid retrieval system, working with both image visual

features and text-based information. It allows the user to extract spine x-rays images from a

database by providing a sketch/image of the vertebral outline. The retrieval process is based in a

active contours algorithm for shape discrimination. The Image Retrieval for Medical

Applications (IRMA) system [39] is a generic web-based x-ray retrieval system. It allows the

user to extract images from a database given an x-ray image query. Local features and similarity

measures are used to compute the nearest images. The SPIRS and IRMA systems were merged

to form the SPIRS-IRMA system, with the functionalities of both. More recently a CBIR

framework prototype was proposed for retrieval of images from a broader domain, including x-

rays, CT and US [40]. In this system multiple features from the image, based in intensity, shape

and texture, are extracted from a given query and used to retrieve similar images based on

similarity measures. Reviews of CBIR for medical applications can be found in [41] and [42]. A

review of 21 CBIR systems for Radiology can be found in [43].

Medical applications are one of the priority areas where CBIR can meet more success

outside the experimental sphere due to population aging in developed countries.

Notwithstanding the progress already achieved in the few frameworks available here is still a lot

of work to be done in order to develop a commercial system able to fulfill image

retrieval/diagnosis comprehending a broader image domain.

12

1.31.31.31.3 Structure of this documentStructure of this documentStructure of this documentStructure of this document

In this chapter we presented the motivation for the problem, a small survey of the CBIR

paradigm, its current state and future work. The rest of this thesis is structured as follows:

• In Chapter 2 we discuss the related work. We start by presenting the IRMA hierarchical

code for classification of medical images, the adopted error evaluation metric and a

survey comprehending the IRMA database. Next, we survey the work for the

imageCLEF medical annotation tasks from 2005 to 2009 as well as other works on the

IRMA database1.

• Chapter 3 contains the background information for the comprehension of this work.

Images structure, color, shape and texture properties will be addressed, together with a

discussion of the global and local descriptors used herein for image retrieval. Moreover

we refer to the machine learning technique used, the Support Vector Machine (SVM),

and the IRMA database subset used in this work.

• Chapter 4 will contain the methodologies used, like the bag-of-visual words,

classification strategies and decision fusion schemes.

• In chapter 5 we present the results together with a discussion regarding these. In chapter

6 the major conclusions and future work will be drawn.

1.41.41.41.4 GoalsGoalsGoalsGoals

The proposed problem in this work consists in medical image classification/annotation:

given a medical image we want to know what it is, e.g., what is the image class taking into

consideration a database of collected images belonging to specific classes.

Some general goals were defined in the beginning and were adjusted depending on the

intermediate results achieved. An emphasis on learning subjects related with this work was also

an essential part of the initial goals. These were:

• To comprehend the fundamental aspects involved in CBIR, namely in the areas of

image processing, focusing on image descriptors, and machine learning, namely

Support Vector Machines (SVMs).

• To study the previous work done in the last years for CBIR in the medical image

domain, especially the IRMA Database related work.

1 The IRMA Database and all images presented in this work are a courtesy of T.M. Deserno, Department

of Medical Informatics RTWH Aachen, Germany.

13

• Based on the two previous points, to implement one of the best approaches for the

available IRMA database within the institution. This goal was adjusted very early.

Instead of an implementation we decided to design our own system. However some

aspects of previous related works were preserved.

• To use well known image descriptors, as well as other image descriptors not used in

previous related works, focusing in those with code provided by feature extraction

engines or authors.

• To investigate new classification strategies and compare them with the related works.

• To investigate fusion methods to improve our initial results in order to make them

competitive or better than the results found in literature.

• To point new directions and considerations where future work can be developed.

In order to achieve these goals we followed a set of fundamental principles of the computer

vision/machine learning group within Instituto Nacional de Engenharia de Sistemas e

Computadores (INESC-Porto)1 where this work was developed:

• Images were considered in their raw format.

• Results were in a quantitative basis in order to allow comparison with related works.

• Ongoing progress and preliminary results were presented in regular meetings within the

institution in order to gather feedback from researchers that work in similar problems

and whose contributions proved valuable.

1.51.51.51.5 Main contribuMain contribuMain contribuMain contributionstionstionstions

The main contributions of this work are:

• An experimental performance evaluation of image descriptors in the context of medical

image annotation in medical databases.

• A new interpretable method of classification using the SVM based on the hierarchical

standard for medical images adopted and its comparison with other methods used in

related works.

• An experimental evaluation of the fusion between methods.

• A relearn method using SVM’s to identify potential wrong classified images that will be

subjected to the fusion process stated in the previous point.

1 http://www2.inescporto.pt/

14

Results from this work led to the following publication:

• Igor F. Amaral, Filipe Coelho, Joaquim F. Pinto da Costa and Jaime S. Cardoso;

“Hierarchical Medical Image Annotation Using SVM-based Approaches”, in

Proceedings of the 10th IEEE International Conference on Information Technology and

Applications in Biomedicine, 2010.

15

Chapter 2

Related work

2.12.12.12.1 The IRMA codeThe IRMA codeThe IRMA codeThe IRMA code

HE IRMA code for medical image classification [44], is a mono-hierarchical multi-axial

classification scheme for medical images. It consists in four axes, with three to four

positions each, describing different content within the image: the technical (T) axis code

describing image modality; the direction (D) axis code describing body orientation; the

anatomical (A) axis code for the body region examined; and the biological (B) axis code for the

examined body system. All axes have three positions with the exception of the T axis, with four.

Therefore the full IRMA code for one particular image consists in 13 characters (IRMA: TTTT-

DDD-AAA-BBB). In order to emphasize the mono-hierarchical order of the positions we

adopted a slightly different notation for the IRMA code – IRMA: T1T2T3T4-D1D2D3-A1A2A3-

B1B2B3. This means that, for example, that the position T2 in the modality axis is hierarchically

higher than the position T3 within the same axis. This notation will prove to be useful in

subsequent chapters.

The possible values for a particular position can be {0,…,9,a,…,z} where ‘0’ in a particular

position of an axis denotes ‘unspecified’, truncating the code and forcing the assignment of the

same value to any hierarchically inferior position. Each sub-position in an axis, i.e. a position

that is not the hierarchically highest (the root), is connected with one and only one hierarchically

higher position. Therefore any axis consists in a tree whose leafs are reached by one and only

one top-down way, making it, as previously stated, mono-hierarchical. An IRMA code will

consequently be a forest of trees, each representing an axis. Only two relational sentences are

allowed: “is a” for the root and “part-of” for sub-positions. Even if the meanings of two or more

sub-positions at different hierarchical levels are literally identical, like in some T3 and T4 sub-

positions for sonography modality (T1=“2”, T2 � �1, … ,8�), depending of the value of the sub-

position hierarchically higher of which they are “part-of”, different meanings of the axis are

established, guaranteeing non-ambiguity. Such structure allows the development of methods for

semantic queries in databases. Figure 2.1 shows some examples of images with their respective

IRMA codes.

T

x-ray, projection radiography, analog, high

energy – coronal, anteroposterior, supine

chest

x-ray, projection radiography, analog,

overview image – sagittal, mediolateral

lower extremity/leg, knee, left

musculoskeletal system

IRMA: 1121-230-942

Figure 2.1 – Examples of x-ray images annotated with the IRMA code. Notice that some axis

may be completely ‘unspecified’ (F

Task database).

IRMA: 1123-127-500

ray, projection radiography, analog, high

posterior, supine -

x-ray, projection radiography, analog, high

energy – sagittal, right-

inspiration – chest

ray, projection radiography, analog,

sagittal, mediolateral –

ity/leg, knee, left knee –


x-ray, projection radiography, analog,

overview image – sagittal, left

– cranium, neuro cranium


942-700

ray images annotated with the IRMA code. Notice that some axis

may be completely ‘unspecified’ (From: 2007 ImageCLEF Medical Annotation

IRMA: 1121-220-230

500-000 IRMA: 1123-211-500

16

projection radiography, analog, high

-left lateral,

chest

ray, projection radiography, analog,

sagittal, left-right lateral

cranium, neuro cranium –


ray images annotated with the IRMA code. Notice that some axis

Medical Annotation

230-700

500-000

17

The technical (T) axis, with four positions, describes the image modality by assigning the

image source of acquisition to the T1 position whose details are then assigned to the T2 position.

T3 position specifies the technique used, with more details on such techniques specified in the

T4 position. Directional (D) axis starts with the description of the common orientation of the

body in the D1 position whose details are specified in the D2 position. Here there was the

concern to distinguish the posteroanterior and anteroposterior directions due to the differences

in scale between organs or bones. The last position of the directional axis, D3, describes the

functional orientation during the exam. The anatomy of the human body is described in the

anatomy axis (A). The first position of this axis, A1, specifies nine major regions and the

subsequent positions, A2 and A3, define in more detail these regions. The B axis denotes the

organ system under analysis and complements the A axis because different type of organs exist

in the same anatomical region. The B1 position in this axis specifies 10 organ systems and the

rest of the position, B2 and B3, specify a particular system until an organ is identified.

Due to the structure of the IRMA code, modifications or extensions can easily take place by

replacing or adding new positions/position values, or even by adding completely new axis.

Other standards with the same purpose suffer from incompleteness, ambiguity, lack of causality

and without hierarchy. MeSH thesaurus is a polyhierarchical standard, where several codes for

modality possess the same meaning of a unique IRMA T axis code and lack of incompleteness.

DICOM and SNOMED nomenclatures are incomplete and ambiguous for the description of the

body anatomy. JJ1017 standard is closely related to the IRMA code but offers only three axes

for image classification, raising problems for semantic retrieval due to ambiguity and lack of

detail for the human body regions. Specific examples of the limitations of these standards in

comparison with the IRMA code can be found in [44]. A complete reference for the IRMA code

values is available at request at the IRMA project website1.

2.22.22.22.2 Error evaluation for the IRMA codeError evaluation for the IRMA codeError evaluation for the IRMA codeError evaluation for the IRMA code

Consider a particular axis X of an IRMA code. Let � � �, ��, … , � , … , � be the correct code

for X. Here � is a particular position in X and I is the depth of the tree for the considered axis.

Notice that I may change for different axis. Let �� ,��, … , �� , … , �� be a classified code for X

where each �� position can be of any specific value precisely for the position or, if a “do not

know” decision is chosen, the coding can be done by using a wildcard “*”. If a position �� in X

was wrongly classified then all positions �� , … , �� will also be considered wrong due to the

hierarchical structure of the IRMA code. If in the same position a ‘0’ or “*” classifications are

1 http://www.irma-project.org

18

given then, again, all subsequent hierarchically inferior positions will be ‘0’ and “*”

respectively.

The error corresponding to a particular axis is given by:

∑ �� , ��

With

�� , �� !0 #$ �% � ��% &' ( #0.5 #$ �% � + ,' ( #1 #$ �% - ��% ,' ( # . (2.2)

where in (2.1)

• (a) is the branching factor, that accounts for the decision difficulty in the specific

position, with bi the number of possible values for the specific position.

• (b) is the position in the axis code string and account for the level in the hierarchy.

• (c) is the weight given to a correct/not known/incorrect decision.

If an error is found in a position i then ��/, ��/� � 1 for every 0 � �# 1 1, … , 2�.

The normalized error is given by:

∑ 34�3�5�6�,6��7�83∑ 34�3�7�83

The normalized error values for an axis range between 0, for a complete correct

classification, and 1 for a complete misclassified axis. The contribution of this error for the total

error of a complete IRMA code is weighted by the number of axis. Therefore, because we are

considering a multi-axial scheme, the error count for each axis is obtained by multiplying (2.3)

by 1 9 : where k is the number of axis in the IRMA code. In our case 1 9 : is 0.25 because four

axes are considered.

(2.1)

(2.3) (2.3)

19

In Table 2.1 an example for the error count in the anatomical (A) IRMA code axis is

presented. For the specific code the bi value1 in (2.1) is 11 for the A1 position, 7 for the A2

position and 8 for the A3 position. The maximum error for a complete misclassified axis is

therefore 0.20400 by setting the (c) parcel in (2.1) to 1. The weight of the error for each of the

four axes in the IRMA code is 0.25. Multiplying this value by the normalized error gives us the

error count. This is the contribution for the total error of a complete IRMA code.

Correct code: 463

Classified Error

(eq. 2.1)

Normalized Error

(eq. 2.3)

Error Count

463 0 0 0.000000

46* 0.020833 0.102122 0.025531

461 0.04166 0.204244 0.051061

4*1 0.05655 0.277188 0.069297

4** 0.05655 0.277188 0.069297

47* 0.11310 0.554377 0.138594

473 0.11310 0.554377 0.138594

477 0.11310 0.554377 0.138594

*** 0.10200 0.5 0.125000

731 0.20400 1 0.250000

Table 2.1– Example for error count in the anatomical (A) IRMA code axis (From: [45])

Two special situations not depicted in table 2.1 should also be considered: if the true code of

the axis is completely unspecified, or ‘000’, assigning wildcards for all the positions, ‘***’, is

not considered an error; if the true code of the axis is again completely unspecified and if a

wildcard is assigned to the first position and other values, even if correct, to the subsequent

positions, like ‘*00’, then the classified result will have an error accordingly to (2.1).

The goal of this error counting scheme is to penalize wrong decisions that are supposed to be

of easy classification, i.e., positions is a high hierarchical position or that have few choices for

that particular node, over wrong decisions made at hierarchically deeper nodes or when there is

a wide range of possibilities for such nodes.

1 Accordingly with the IRMA code for 2007 Medical Annotation Task. Changes in IRMA the code since

this year have an influence in the penalization for a wrong decision, with its consequences for the error

count.

20

2.32.32.32.3 ImageCLEF Medical Image Annotation TasksImageCLEF Medical Image Annotation TasksImageCLEF Medical Image Annotation TasksImageCLEF Medical Image Annotation Tasks

Evaluation campaigns for information retrieval, object classification, detection and

segmentation, machine translation, video tracking and even speech recognition are becoming

more and more adopted as a way to research new methods or to improve existent ones. The

competitive character of such campaigns, where several teams aim for the best overall results, is

useful to benchmark proposed systems. The idea behind the campaign concept is this: a

database is provided for a task on one of the designated areas. If the results are satisfactory then

the amount of data provided in the next evaluation campaign increases, making the complexity

of the problem higher, or a new database is provided for a different task. This data reusability

allows researchers to learn from their previous experiences and refine their future work.

Examples of evolutionary campaigns are, for example, the Text REtrieval Conference (TREC)1,

existent since 2001 for information retrieval, TRECVID2, as a part of TREC for video retrieval,

and PASCAL3 network, in 2005 and 2006 for image segmentation, object detection and

classification.

The evolution of the state-of-the art for medical images automatic annotation methods based

on purely visual features can be tracked in the CLEF4 cross-language image retrieval campaign

(ImageCLEF), taking place since 2003 for digital image libraries information extraction,

medical image annotation tasks, which ran from 2005 to 2009. From 2010 onwards this task

will not take place. Like other tasks part of the ImageCLEF Campaign, the medical image

annotation task also assumed a competition format. The goal of the tasks, however, was to

explore, develop and promote automatic annotation techniques and strategies for semantic

information extraction in medical images databases with little or no annotation.

The IRMA database was gradually used during the ImageCLEF Medical Annotation Task.

This database consists of approximately 17000 radiographs collected from the Department of

Diagnostic Radiology at RWTH Aachen University, Aachen, Germany5 during daily routine

examinations. A consequence of this routine is the uneven distribution of the types of

radiographs gathered. However in the database each class has a minimum of 10 images. A

distinctive characteristic of this database is that many images have strong intra-category

variability, e.g., very distinctive images with similar codes, and inter-category similarity, e.g.,

very similar images possessing different codes (Figures 2.2 and 2.3).

1 http://trec.nist.gov

2 http://www-nlpir.nist.gov/projects/trecvid/

3 http://www.pascal-network.org

4 http://www.clef-campaign.org

5 http://www.rad.rwth-aachen.de

21

Figure 2.2 – Example of x-rays belonging to the IRMA database with high intra-category

variability. All images share the same IRMA code 1121-120-800-700 (From [46]).

Figure 2.3 – Example of x-rays belonging to the IRMA database with high inter-category

similarity. Top row images are from the elbow, with an IRMA code xxxx-xxx-

44x-xxx, and bottom row images are from the knee, with an IRMA code xxxx-

xxx-94x-xxx (From [46]).

All images were stored using gray level values in Portable Network Graphics (PNG) format.

Furthermore images were scaled proportionally to their original size, keeping aspect ratio, in

order to fit a 512x512 maximum pixel window. All images were annotated by radiologists using

the IRMA but this was disregarded during the 2005 and 2006 tasks, where all images were

annotated simply by a code number. Only in 2007 the IRMA code was introduced for

classification purposes.

2.3.12.3.12.3.12.3.1 2005 Medical Annotation Task2005 Medical Annotation Task2005 Medical Annotation Task2005 Medical Annotation Task

The 2005 Medical Annotation Task [47] provided a subset of 10000 images from the IRMA

database. Of these 10000 images a random set of 9000 was selected as training data, published

22

with the annotation, and the remaining 1000 images were given for evaluation, without any

annotation. Images belonged to 57 distinguished classes and no IRMA code was used for

annotation. Thus images classes were classified as integers ranging from 1 to 57. Error

evaluation was performed by considering the total error rate, i. e., the percentage of wrongly

classified images, and not the error evaluation scheme presented in section 2.2. A total of 12

teams participated submitting 41 runs1. Table 2.2 gives an overview of the results for the 2005

Medical Annotation Task.

Rank Team Error Rate (%) Difference (%)

1 RWTH-i6 12.6

2 RWTH-mi 13.3 0.7

3 Ulg.ac.be 14.1 1.5

4 Geneva-gift 20.6 8.0

5 Infocomm 20.6 8.0

6 MIRACLE 21.4 8.8

7 NTU 21.7 9.1

8 NCTU-DBLAB 24.7 12.1

9 CEA 36.9 24.3

10 Mtholyoke 37.8 25.2

11 CINDI 45.3 30.7

12 Montreal 55.7 43.1

Table 2.2 – Results overview from 2005 ImageCLEF Medical Annotation Task. (From: [47]).

RWTH-i6 - Computer Science Department, RWTH University, Aachen, Germany:

The Computer Science Department from RWTH University was the winner team with a total

of 12.6% misclassified images. The method consisted in the Image Distortion Model (IDM)

using image thumbnails of sizes ; < 32. From the thumbnails vertical and horizontal gradients

by applying a Sobel filter, Tamura texture features and ��3 < 3�, �5 < 5�� subimages were

extracted. All features were extracted in the Flexible Image Retrieval Engine2 (FIRE) CBIR

framework. The IDM is related to the field of image registration by inherent optimization or

matching process and aim to compensate only deformations that leave the image class

unchanged, discarding emphasis on discrimination between classes. This is not the objective of

other methods in the same area, which focus on best matches to distinguish images of two

classes.

1 Sometimes the number of submissions of one team can be very high. The results and discussion

presented for each task and each team will focus only on the method that performed best. 2 http://thomas.deselaers.de/fire/

23

Classification was made using a Nearest Neighbor (1-NN) classifier. Every image was

mapped into a reference image (the training set) by summing the local pixel distances which

were squared Euclidian distances. The distance between images was calculated by minimizing

the cost over all possible deformation mappings. For this, a subset of 1000 images from the

training data was randomly selected and the best parameters for the model were chosen. From

experiences on this set it was found that ; < 32 image thumbnails features had a better

performance. This performance held for the test set after evaluation. Other slightly different

runs from Rwth-i6 team also performed very well for this task.

RWTH-mi – Department of Medical Informatics, RWTH University, Aachen, Germany:

The Department of Medical Informatics from RWTH University had an error rate of 13.3%,

reaching the second place in the task. An IDM using the Cross Correlation Function (CCF)

for 32 < 32 image thumbnails and a 1-NN Euclidian distance classifier combined with and an

IDM using Tamura texture features with the Jensen-Shannon divergence as distance metric for

histogram comparison. Because the IDM is time consuming, performance was boosted by

passing only the 500 nearest neighbors to the CCF-IDM. Best parameters for the weight of each

IDM in model combination were empirically evaluated using a subset of 1000 images from the

training set.

Ulg.ac.be – Institut Monteflore, University of Liège, Liège, Belgium:

The team from University of Liège scored an error rate of 14.1% using random 16 < 16

patches randomly extracted from the images. Patch extraction was performed inversely

proportional to the number of examples of each class in the training ser. More

precisely @A �B < 0C �⁄ , where @A is the number of patches; m is the number of classes and 0C the number of images of class c. A fixed total of @A � 800000 patches were extracted for

the training set, giving approximately 14000 patches per class. For the test set this number was

fixed in 500 patches for image. Image contrast enhancement, with ImageMagick1, was applied

to every patch. For the learning phase 25 decision trees with boosting (Tree Boosting) were

used together with a stop-splitting criterion using a G2 statistic to determinate the significance of

the test. A G2 statistic can be seen as an alternative to the chi-squared goodness-of-fit test and is

mostly used in hierarchic models of data. All patches for a test image were then aggregated

accordingly to their classified class and the final classification was reached through majority

voting over the patches, image by image.

1 http://www.imagemagick.org/

24

Geneva-gift – University and University Hospitals of Geneva, Service of Medical

Informatics, Geneva, Switzerland:

The team from Geneva had an error rate of 20.6% using an adaptation of the GNU Image

Finding Tool1 (GIFT) for medical images, the medGIFT

2, for image feature retrieval. medGIFT

allows a number of image descriptors, like color histograms and Gabor Texture Filters (GTF),

considering multiple scales, gray levels and directions. Best results relied in 8 gray level images

and 8 directions for the GTF. For this features were extracted from the training set and a term

frequency-inverse document frequency (tf/idf) for weighting was performed. The same

procedure took place for test set and a query was made for every image. Query results consisted

in the 5-NN using histogram intersection for similarity scoring and comparison. The class with

the highest final score became the class selected for the image.

Infocomm – Institute for Infocomm (I2R), Singapore:

Infocomm team achieved a similar score than Geneva-gift with 20.6% of error rate. For this

task various image features were extracted. Among these were polarity, anisotropy and contrast

for texture and Low Resolution Pixel Maps (LRPM) from 16 < 16 thumbnails from the images

for spatial layout. Several subsets of the training data were used to train a SVM with a Radial

Basis Function (RBF) kernel. Once the best parameters for cost and gamma were chosen the test

set was classified. This group also combined the previous descriptors with Blob features with

better results but never submitted this run. Other techniques like Principal Component Analysis

(PCA) were also attempted but with less success.

MIRACLE – Universidad Politecnica de Madrid, Universidad Carlos III de Madrid,

DEADALUS S.A., Madrid, Spain:

MIRACLE team scored 21.3% of error rate during this task. Images were reduced to 32 < 32 thumbnails and GIFT was used for image retrieval given a test image as query.

Classification was performed via a decision table: for the 20-NN a weighting function was

applied to the relevance of each result. Weights corresponding to the same class were summed,

as a measure of confidence, and the class correspondent to the highest sum value was assigned

to the query image. The best number of closest retrievals used was optimized using 10cross

validations for training. Other strategies from this team using a NN classifier achieved a worst

performance rate.

1 http://www.gnu.org/software/gift/

2 http://www.sim.hcuge.ch/medgift/

25

NTU – National Taiwan University, Taipei, Taiwan:

NTU team scored the same error rate of 21.7% using two methods. Images were resized to 256 < 256 pixels and divided in 32 < 32 pixel blocks, each with 8 < 8 pixels. The average

gray value for each block was computed giving a 1024 elements feature vector. Similarity

between test images and classes was measured using the cosine metric in a 1-NN classifier and a

2-NN classifier. No learning phase seemed took place for this method.

NCTU-DBLAB – Department of Computer & Information Science, National Chiao

Tung University, Hsinchu, Taiwan:

NCTU-DBLAB scored an error rate of 24.7%.. The method presented involved image

scaling to a 8 < 8 pixel size and the corresponding 64 dimension pixel gray level was used as a

spatial feature to feed a SVM classifier with RBF kernel. Other image feature combinations and

SVM kernels were used without better success. No indication of a training phase for kernel

parameterization was found.

CEA – CEA-LIST/LIC2M, Fontenay-aux-Roses, France:

CEA team submitted three runs with the best scoring an error rate of 36.9%. All images

were resized to 100 < 100 and divided in four blocks of 50 < 50 pixels. A Sobel filter was

applied to each block and the pixels within were projected in the vertical and horizontal,

originating an aggregation histogram. The test image class is attributed by majority voting of the

3-NN using a Euclidian metric for histogram comparison.

Mtholyoke – Mt. Holyoke College, South Hadley, Massachusetts, USA:

The method used by Mount Holyoke College started by scaling images from both sets to 256 < 256 pixels. Each image was then divided in a 5 < 5 square grid. A total of 250000

blocks for all images were built. Because Gabor energy measures and Tamura texture features

are not correlated these features were extracted from all blocks with the help of the FIRE

framework. Coarseness was separated from the Tamura textures and used as a separate feature.

Two clustering methods, Cluster Query Likelihood (CQL) and Cluster Based Document Model

(CBDM), were used. Ranking measures using the error rate and a K-NN clustering showed that

Gabor energy performed best as image descriptor, even better than a combination with Tamura

texture features, with the CQL model. Best model parameters were optimized with 10 cross

validations and different values of K for each set of features. These models were constructed

with the 9000 images training set. Several clustering methods for the models proposed, K-

means and K-NN, were experimented with the last producing better results. K=25 for the CQL

26

model and K=50 for the CBDM model were empirically established and the models were

submitted for evaluation of the test set. It is not clear in the literature which of these models

performed best but both had a similar error rate of 37.8% and 40.3%.

CINDI – CINDI group, Concordia University, Montreal, Canada:

CINDI team submitted only one run with an error score of 43.3%. Several image features

were extracted to build an approximately 200 elements vector: invariant moments and Canny

Edge detector for shape; gray level co-occurrence (GLCM) matrices for texture and, from these,

higher level features like entropy, energy, contrast and homogeneity. Training set feature

vectors were used as input to a SVM with a RBF kernel. Best parameters using 10 cross

validation folds for the SVM kernel were �E, F� � �200, 0.01� giving a 54.65% rate of correct

classifications. The expected performance for the test is only 2.05% better than the expected.

Montreal – Montreal University, Montreal, Canada:

Montreal team achieved the worst performance for the task with 55.7% of misclassified

images. Aside the fact that a combination of Fourier shape and contour descriptors together with

texture coarseness and directionality were used for image feature extraction, no method for

annotation is known because no work notes was ever published by this team.

Discussion for the 2005 Medical Annotation Task:

The first Medical Annotation Task gathered a good number of participants with submissions.

However 14 of the teams registered for the task did not participate. This can be related with the

interest in the data provided rather than present any methodology for the task. Such a behavior

will be a constant for every edition of this task.

If we look at table 2.2 it is visible that the best results of each team can be grouped in three

categories: below 15% with 3 teams; from 20% to 25% with 5 teams; and from 35% to 56%

with 4 teams. RWTH University teams, partially because were more familiarized with the data,

achieved the best performances with the IDM. Pixel intensity values from scaled images

outperform general image features but only slightly against object recognition methods using

patches as visual words representation. Classifiers performance against IDM and k-NN is mixed

as it is spread all over the rankings, with good and less good results. Some teams used existent

frameworks for feature retrieval in what could be a sign of necessity of such systems for rapid

experimentation. Fusion of the best results was attempted but no score improvement could be

achieved.

27


The subset for the 2006 Medical Annotation Task [48] increased to 11000 images, 10000

annotated for training and 1000 without any annotation for evaluation, from 116 different

classes. No IRMA code was used for annotation and the error evaluation was made by taking

the error rate in consideration, as in 2005. In this task 28 runs were submitted by 12 teams.

Another 14 teams did not submit any runs. Table 2.3 shows the score rankings accordingly to

the best run of each team.

Rank Team Error Rate (%) Difference (%)

1 RWTH-i6 16.2

2 UFR 16.7 0.5

3 MedIC-CismEF 17.2 1.0

4 MSRA 17.6 1.4

5 RWTH-mi 21.5 5.3

6 CINDI 24.1 7.9

7 OSHU 26.3 10.1

8 NCTU 26.7 10.5

9 MU 28.0 11.8

10 ULG 29.0 12.8

11 DEU 29.5 13.3

12 UTD 31.7 15.5

Table 2.3 – Best runs / team for the 2006 ImageCLEF Medical Annotation Task (From: [48]).


The Computer Science Department from RWTH University managed, once again, to achieve

the overall best performance for this task with an error rate of 16.2%. The best run used sparse

histograms of image patches and a discriminative log-linear maximum entropy model was used

for classification. Square patches of edge lengths of 7, 11, 21 and 31 pixels were extracted at

every position of the images, allowing coverage of objects of different sizes and providing some

degree of invariance to scale changes. Patches dimensionality was reduced to values between 6

and 8 pixels features with the use of Principal Component Analysis (PCA).

Histogram grid was built with uniformly distributed bins using the mean and variance of the

dimensionally reduced patches. For every image histograms with 65536 bins was built. The

position of each patch was added to the histogram as a way to add spatial information and,

consequently, achieve invariance to translations. Log-linear maximum entropy models optimize

the class posterior probability by discriminative training. Model parameters were optimized

28

with a modified generalization of the iterative scale algorithm and classification followed

Bayes’ decision rule. This team also submitted the run used in the 2005 Medical Annotation

Task but this method underperformed. Another run using SVM’s with histograms from image

patches was also presented with results very similar to the best run.

UFR – LMB Group, Albert-Ludwigs-University, Freiburg, Germany:

Albert-Ludwigs-University team best error rate score was 16.7%, not very far from the best

overall score. Using a wavelet-based point detector, relational feature vectors are extracted from

by three parameterizations from a relational function for texture analysis, considering a

surrounding region, in order to capture high level concepts from the image. These vectors are

concatenated and, for all images, clustered by a k-means algorithm. The number of clusters is

set empirically. All feature vectors from each image are accumulated in a global feature vector

using the 1-NN cluster center in three steps: first an all invariant accumulator with 20 bins by

simply counting the number of clusters occurrence for each image was created; second a

rotation invariant accumulator with 10 bins was made for all pairs of salient points lying within

a specific distance range and sharing identical cluster indices; third an orientation invariance

accumulator with 4 bins, using an co-occurrence matrix to capture the statistical properties of

the joint distribution of cluster membership indices, was built for the all pairs of salient points.

The total feature vector for one image totalized 16000 elements. Best run, from the two

submitted, used 1000 salient points per image and for classification a one-vs-rest multiclass

SVM with a histogram intersection kernel and empirical parameterization.

MedIC-CismEF – LITIS Laboratory - INSA de Rouen, CISMeF Team, Rouen University

Hospital & GCSIS Medical School of Rouen, Rouen, France:

MedIC-CismEF team method started to resize all images to 256 < 256 pixels. After splitting

each image in 16 equal blocks (64 < 64 pixels) global and local features were extracted. These

features include: 4 co-occurrence matrices, each for 4 different directions, after 64 gray-level

quantification, producing a 16 feature vector; the fractal dimension, a single number between 2

and 3 denoting the texture smoothness; Gabor features for 3 scales and 4 orientations,

generating a 24 feature vector; a 3 < 3 discrete cosine transform with the exclusion of the low

frequency component, resulting in a 8 feature vector; gray-level-run-lengths in different

directions, yielding a 14 features vector; Laws features mask for textural energy for a 28 feature

vector; Multispectral Simultaneous Autoregressive Model (MSAR), where a grey level of a

pixel is expressed as a function of the gray levels in its neighborhood, for a 24 feature vector;

gray level statistical measures, first to fourth moments, as local features resulting in a 7 feature

vector. The total feature size is 122 for each block. Considering also the extraction of all

29

features for the scaled image as a whole this results in a 2074 feature vector per image. Best run

used an SVM with RBF kernel, whose parameterization was tuned with cross validation using

the training set, after a PCA over the feature space using 95% of the variance as reference,

yielding to a reduced feature vector of 335 elements. Error rate score was 17.2% and a third

place in the task achieved.

MSRA - Microsoft Research Asia, Beijing, China:

MSRA team scored an error rate of 17.6% with the best method relying only in global image

descriptors information together with an SVM for classification. Three main global features

were extracted: images were divided in 8 < 8 blocks and normalized gray average levels from

each block, for illumination invariance, were extracted; images were again divided in 4 < 4

blocks and wavelet coefficients, for texture information, from each block at different scales

were computed; images were converted to binary using Otsu’s method for the threshold. The

area and the central point of the regions were calculated. Then, morphological operations were

performed to extract the contour and edges of the image to describe its shape. These features

were duplicated 6 times to increase their importance in the SVM. An RBF kernel was chosen

for this classifier and parameters were tuned by cross-validation using 5 folds. A similar method

using local descriptors was also attempted with slightly worse results.


The Department of Medical Informatics from RWTH University returned this year with

exactly the same method from the 2005 Medical Annotation Task, only with the parameters

adjusted for the new expanded database. Classification yielded an error rate of 21.5%, a worse

performance than the one in 2005.

CINDI – CINDI group, Concordia University, Montreal, Canada:

CINDI team presented a different method during this competition. Instead of an SVM whose

input was the fusion between different image features, a pairwised combination of SVM’s, each

using one different image description, was used. MPEG-7 image descriptors, the Edge

Histogram Descriptor (EHD) and Color Layout Display (CLD), were extracted as global

features. From 5 overlapping grid image divisions invariant shape moments (first and second)

and the GLCM were also extracted and combined in a semi-global global descriptor. Scaled

images, of size 64 < 64 pixels were also used and, from these, the mean gray level from 4 < 4

30

pixels blocks was calculated and concatenated. The summation rule for the combination of

classifiers delivered the best results with an error rate of 24.1%.

OSHU - Department of Medical Informatics & Clinical Epidemiology; Oregon Health

& Science University, Portland, OR, USA:

OSHU team achieved a best error rate of 26.3%. Four runs were submitted using different

image descriptors. Best run used gray values from 16 < 16 image thumbnails in an 32 bins

histogram. Classification was performed by using a neural network, with multi-layer perceptron

architecture, optimized with the training set of images.

NCTU – Department of Computer & Information Science; DBLAB, National Chiao

Tung University, Hsinchu, Taiwan:

No work notes from the NCTU team could be found but, amidst some image descriptors

used in the 2005 Medical Annotation Task (not the most successful), Gabor texture features and

coherence moment, the corrected vector layout was also added. These three image descriptors

together with a NN classifier yielded an error rate of 26.7%.

MU –Media Understanding Group, Institute for Infocomm (I2R), Singapore:

MU team used a two-stage SVM classification for the test set annotation using different

features from those used in the 2005 Medical Annotation Task. A 16 < 16 map of salient

regions denoting the conspicuity of the image was computed, forming a 256 feature vector.

From multi-resolution wavelets a set of salient-points were detected in the image. Then, 13 < 13 image patches around the top 50 salient points were extracted and the respective gray

level values were turned into feature vectors. SIFT was also used to extract multi-directional

features in 4 < 4 patches around keypoints detected with a Difference of Gaussians (DoG).

Finally histograms of pixel gray values from left-tilt strips were also constructed.

In the first stage of the classification an SVM with RBF kernel is trained using cross-

validation for parameter tuning and feature selection. From the training set, the classes that

include 95% of wrongly classified samples were marked for a more refined second evaluation

stage. This refinement process using the training ser defines a threshold used in the test set. To

evaluate a “bad” image in the training set the three maximal elements from the distance vector

from a reference basic system are considered. If a criterion involving the first and second

maximum values is met then the image will be reclassified. Otherwise, if a criterion involving

the second and third maximum values is verified then that image will be declared as “good”.

The error rate scored 28.0 for this methodology.

31

ULG – Institut Monteflore, University of Liège, Liège, Belgium:

ULG team submitted runs based in the same methods explored in the 2005 Medical

Annotation Task. Although these methods achieved a very good performance in 2005, for 2006,

and using 20 extremely randomized trees, classification results were less good, with an error

rate of 29.0%.

DEU – Dokuz Eylul University, Tinaztepe, Turkey:

DEU team scored an error rate of 29.5 using the EHD as image descriptor and a 3-NN

classifier. Only one run was submitted.

UTD - The University of Texas, Dallas TX, USA:

UTD team achieved the last place in the task with an error rate of 31.7%. The only run

submitted scaled all images to 16 < 16 thumbnails and then performed PCA over the pixel gray

level values. A weighted k-NN was used for classification.


The second year of this task saw, with exception of the best results, an overall improvement

of the error rates even with the complexity increase of the problem. Top results are only

separated by a maximum of 1.5% in error difference. In a second group the error rate varies

between 21.5% and 31.7%.

The increase in the number of classes made Rwth-i6 and Rwth-mi runs based in the IDM to

underperform. ULG run based in tree boosting was also not as successful as in the 2005 Medical

Annotation Task. SVM’s undoubtfully dominated the classifier preferences with 9 teams relying

on these for the task proposed. One of the most interesting approaches is probably the MU team

SVM classifier because it is the only one that tries to implement active learning by

automatically identify and reclassify suspicious classes. A fusion of the top three results using

majority voting led to an improved error rate of 14.4%. In general image recognition and

detection techniques seem suited for automatic annotation as well.

2.3.32.3.32.3.32.3.3 2007 2007 2007 2007 Medical Annotation TaskMedical Annotation TaskMedical Annotation TaskMedical Annotation Task

In the 2007 Medical Annotation Task [49] a database of 11000 annotated training images

was provided. For the first time the complete IRMA code was used for annotation. Thus, best

32

runs were evaluated accordingly to the error evaluation scheme (see section 2.2) and not the

error rate. 1000 test images were made available for this task. Regarding a full IRMA code as a

class of objects a total of 116 image classes existed. A total of 29 teams registered but only 10

submitted their works in a total of 68 runs. Table 2.4 shows the overall results for this year task.

Rank Team Error Score Error Rate (%)

1 BLOOM 26.8 10.3

2 RWTH-i6 30.9 13.2

3 UFR 31.4 12.1

4 RWTH-mi 51.3 20.0

5 UNIBAS 58.1 22.4

6 OSHU 67.8 22.7

7 BIOMOD 73.8 22.9

8 CYU 79.3 25.3

9 MIRACLE 158.8 50.3

10 GENEVA 375.7 99.7

Table 2.4 – Rankings for the 2007 ImageCLEF Medical Annotation Task (From: [49]).

BLOOM – IDIAP Research Institute, Martigny, Switzerland:

BLOOM team scored the lowest error count for the 2007 competition, 26.8. A BoW was

built using SIFT as image descriptor (at one octave only1) but performed in a dense point

sampling covering both training and test sets rather than a keypoints detector. The visual

dictionary consisted in 500 words/concepts for the full image and 2 < 2 partitions. Each

dictionary was built with the aid of K-means algorithm using a Euclidian metric. For the 5

dictionaries considered 1500 sampling points were extracted. Furthermore images were resized

to 32 < 32 thumbnails and the pixel gray values were also used.

For classification two methods, the Discriminative Accumulation Scheme (DAS) and the

Multi Cue Kernel (MCK) were used. While the DAS is a high level integration approach,

classifying the same set using different image descriptors and fusing the results in the end, the

MCK is a mid-level integration approach using a multiclass SVM with a linear combination of

two exponential chi-square kernels, each for a single image descriptor. Cross validation on 5

disjoint subsets from the training set provided the best weights for the kernel combination. Best

run was achieved using the MCK in a one-vs-all SVM.

1 The context of octave in Lowe work, the SIFT descriptor, is similar to image scaling. The reason for the

difference between terms is that Lowe uses scaling to denote different levels of smoothness resulting

from the convolution between the image and a Gaussian function.

33


RWTH-i6 team presented 4 runs based in the previous work on sparse histograms of image

patches used in the 2006 Medical Annotation Task. Histograms of 65536 or 4096 bins were

created and classification performed with SVM’s by taking into consideration the complete

IRMA code or each axis separately. In the end the results of the 4 runs are combined with

majority voting. Error count was 30.9.

UFR – LMB Group, Albert-Ludwigs-University, Freiburg, Germany:

UFR team scored an error count of 31.4 and also repeats their methods based upon relational

features used in the 2006 Medical Annotation Task. After the extraction of these features

classification is performed with SVM’s using the full IRMA code and axis-wise. A Binary

Classification Tree (BCT), using the dot product between SVM’s hyperplanes as a similarity

measure, was also used but did not perform very well.


RWTH team once again submitted a similar run used in the 2005 and 2006 Medical

Annotation Tasks optimized for the current image set and using the full IRMA code as a class.

Three runs were submitted and differ only in the way that the 1-NN predicted codes are

assembled to predict the final code. Error count achieved was 51.3 but later, using 5-NN

predicted codes, the error count worsened while the error rate improved.

UNIBAS – Databases and Information Systems group, University of Basel, Basel,

Switzerland:

The first participation of the UNIBAS team resulted in an error count of 58.1. The main

focus of UNIBAS runs, 7 in total, is to speed up image annotation using a more generic IDM for 3 < 3 and 5 < 5 without degrading the quality of the results. Inspired mainly by the previous

run using the IDM during the 2005 Medical Annotation Task, two layers were used in the

model: in one images were resized to ; < 32 where X is the smallest edge of the full image; in

the other information retrieved from the image consisted in the application of a Sobel filter on

the full image and perform downscaling later. Different weights were applied to both layers,

being the gray level values more relevant than the filtered thumbnails. A series of model

configurations and algorithmic considerations are made to speed up the IDM. A weighted k-NN

classifier, using the inverse Euclidian distance, is then applied. The closest 3-NNs are compared

in order to classify suspicious position along the axis with the “do not know” option.

34

OSHU - Department of Medical Informatics & Clinical Epidemiology; Oregon Health

& Science University, Portland, OR, USA:

OSHU team ranked 6th

scoring an error count of 67.8. Images were scaled in a first phase to

a size of 256 < 256 pixels. Then information from every image was gathered using: the

GLCM; the gray level co-occurrence matrix of five 128 < 128 overlapping blocks from the

image (GLCM2); a global discrete cosine transform; a color histogram with 32 bins; 16 < 16

thumbnails from the scaled images. All the previous features were then again extracted from the

thumbnails. Separately, the spatial envelope (GIST) was also used. A neural network with a

multi-layer perceptron was the classifier for the annotation. The number of hidden nodes was

optimized with the training data. The best run used the histogram for the thumbnails with and

300 hidden nodes for the neural network. The score achieved is very similar to the previous year

task. A neural network classifier also with 300 hidden nodes for the GIST descriptor resulted in

similar, but not better, results.

BIOMOD – Bioinformatics and Modeling group, University of Liège, Liège, Belgium:

BIOMOD returns for the 2007 task using again the randomized trees with boosting, scoring a

73.8 error count. The method does not differ much from the previous year’s considering, like

other approaches for the same task, a unique full IRMA code as a single class and axis-wise

classifications. Later both results are combined. However the combination of methods and the

axis-wise approach do not outperform the full code classification method.

CYU - Information Management AI lab, Ching Yun University, Jhongli City, Taiwan:

CYU team submitted only one run for an error count of 79.3. For image features this team

proposed an illumination invariant relative local measure for neighbor pixels. Accordingly to

this feature any pixel is classified in three categories. Dividing the image in 4 identical blocks

and evaluating the occurrence frequency for pixels within the same category the spatial

information therein is gathered. A signature of the image represented by a vector of 324

elements is then created. A NN rule using an author’s defined metric tuned with the training set

does the annotation.



MIRACLE team uses the FIRE framework to extract several global features from the

images: for histogram-like information gray pixel values histograms and Tamura texture

features; for vector-like information the global aspect ratio, a global texture descriptor and

35

Gabor features. A total of 30 approaches using a 10-NN classifier comprehending one specific

type of features or all together classified the test using the full IRMA code, axis wise and

pairwised axis for normalized and non-normalized features. Best run for all features

(normalized) with an axis-wise strategy provided the best results with an error count of 158.8.

GENEVA – University and University Hospitals of Geneva, Service of Medical


GENEVA team ranked last for this task with an error of 375.7. The error rate was

particularly bad, with around only 30 completely well classified images. Like in the 2005

Medical Annotation Task the GIFT framework was used to extract image descriptors and

annotate the test set. The amount of features extracted was extensive: local color features at

different scales and considering image-block partitions; color histograms; quantization of GTF

in 10 strengths; Gabor filters and aspect ratio. All features were weighted and several k-NN

classifiers for k � �1, . . ,20� were tested. Best final results took into consideration k=5. This

work was largely improved and explored after the task.


From Table 2.4 it is clearly seen that the results can be clustered in three main groups: the

first three results; the 4-8 ranks and the last two ranks. One of the major conclusions that can be

drawn after this task is that SVM’s outperform NN for image classification. All three top teams

use these for annotation. Furthermore these were combined using majority voting for an error

count of 24 and an error rate of 10.3%. Only 54 wildcards on 31 images were placed during the

combination. For image features another important conclusion can be also reached: local image

features provide best results than global ones. Teams ranking from 1st to 5

th place use local

features alone or combinations of these with global features. Only OSHU (6th place) uses purely

global descriptors. Most part of the strategies for classification adopted ignored the hierarchical

structure of the IRMA code and made little or no use of wildcards. The usage of the full IRMA

code as one class of objects or each IRMA code axis as a class considering its specific meaning

dominated the preferences, with the first performing better than the second. Deselaers et al. [49]

also reports that no image was completely misclassified by all submitted runs and only one

image was completely well classified by them all. However such a fact does not bring new

insight for the evaluation of the problem complexity since GENEVA team had a 99.7% error

rate. A conclusion of the review for this task states that “the task is now at the point where it can

be applied directly to images being inserted into a medical picture archiving system” [49].

However this is not present in the runs provided by the BLOOM team since the BoW dictionary

is assembled taking into consideration features extracted from the test set, thus improving

36

identification of visual concepts during the clustering process. Even if no previous annotation is

needed for this technique, which is an acceptable strategy for the task, it founds no applicability

in PACS.

2.3.42.3.42.3.42.3.4 2008 2008 2008 2008 Medical AMedical AMedical AMedical Annotation Tasknnotation Tasknnotation Tasknnotation Task

Training set for this year’s task [50] the database was increased to 12076 images for a total

of 196 unique IRMA codes. Test set was again set to 1000 images for annotation. The extra 80

classes when comparing with the 2007 task placed a very high challenge to the participants.

This is probably the reason why, from 37 registrations, only 6 groups submitted a total of 23

runs.

Rank Team Error Count

1 IDIAP 74.92

2 TAU 105.75

3 RWTH-mi 182.77

4 MIRACLE 187.90

5 GENEVA 209.70

6 FEIT 286.48

Table 2.5 – Results for the 2008 ImageCLEF Medical Annotation Task considering the best

runs for each group (From: [50]).

IDIAP – IDIAP Research Institute, Martigny, Switzerland:

IDIAP team achieved, for the second consecutive year, the first place in the task with an

error score of 74.92. The best of the 10 runs submitted by this team consisted in a low level

feature integration of the previous 2007 Medical Annotation Task modified SIFT descriptor, at

one scale and ignoring rotation, considering the full image and 2 < 2 partitions for a 2500

elements feature vector together with the Locally Binary Pattern (LBP) rotationally invariant

operator. The idea behind the LBP is to extract textural information from the region around a

pixel after performing a binarization accordingly to a threshold. As an image feature the

concatenation of two two-dimensional histograms of LBP was considered leading to a vector of

648 elements.

Automatic annotation was performed with an SVM using an exponential chi-squared kernel

after a simple concatenation of features. Virtual image examples of low represented classes

were added to the training set by performing SIFT invariant transformations. Among these were

small rotations up to 40 degrees, shifts of 50 pixels in several directions, increments of 50-100

pixels in scaling and illumination variances. Best parameters for the SVM kernel were

37

optimized using sub-sets from the training set taking into consideration the class abundance

therein. Once the annotation was performed a confidence based opinion fusion. Using the

maximum distance of a class to the hyperplane the attained distance is subtracted and if it is less

than a defined threshold a wildcard is placed. Other methods were also used considering the

MCK but did not perform better. The method used in 2007 Medical Annotation test was also

submitted with poor results.

TAU – Medical Image Processing Lab, Tel Aviv University, Tel Aviv, Israel:

TAU team achieved the second place in their first participation with a score of 105.75. A

visual vocabulary of 700 words built from a collection of 9 < 9 rectangular patches, separated

by 6 pixels, from 400 randomly selected images. Then the covariance matrix from

approximately 2 million patches (around 2500 patches per image) and PCA was applied to find

its eigenvectors. The 6 highest energy vectors were used as a base for the rest of the patches. All

patches extracted from one image were normalized to have 0 mean and 1 variance thus

providing some invariance to illumination. The patch mean gray level is lost during the PCA

and is later added as a one more feature. Moreover the spatial location of the patches was also

used extending the number of elements in the feature vector to 7. The Euclidian distance was

used for the clustering of all features extracted from the 400 random images to build the

dictionary.

For image annotation a multiclass one-vs-one SVM with a RBF kernel was trained directly

on the histograms. The hierarchy was ignored and every IRMA code was considered as a class

of objects. However other method by this team using SVM’s probability output uses wildcards

in the annotation process.


RWTH-mi team submitted the same run of 2005 and achieved an error count of 182.77. This

run is a baseline for evaluation of the evolution of the other methods during the several years

that the Medical Annotation Task took place. The hierarchal nature of the IRMA code was again

disregarded.



MIRACLE team, with a score of 187.90, achieved the fourth place for this task. The FIRE

engine was abandoned this year and the feature extracted comprised a gray histogram, statistics

from several orders, Gabor features (4 scales, 8 orientations), coocurrency matrix statistics,

38

DCT coefficients, Tamura features and Discrete Wavelet Transform (DWT) coefficients from 256 < 256 pixels resized images. The same features were also extracted from 64 < 64 blocks

from the resized images. For annotation the classifier used contained two blocks: the first to

select the images from the training set whose distance to the feature vector correspondent to a

test image is under a defined threshold; the second generated the IRMA code depending on the

codes and similarity of the chosen images, assigning a “*” when the addition of strings disagree.

This is, indeed, a variation of a k-NN algorithm. Best results were achieved for k=3. Relevance

feedback methods were also applied but led to slightly worse results.

GENEVA – University and University Hospitals of Geneva, Service of Medical


GENEVA team achieved the 5th place in the task with an error count of 209.70. The best

method used the GIFT, as in previous tasks, to perform image annotation. The only changes

were in the parameterization settings and classification strategies adopted. Image features were

similar to those used in the 2007 Medical Annotation Task. The annotation, however, now took

into consideration the bias during the k-NN voting provided by the unbalanced amount of

images from each class. This strategy did not produced better results than a simple axis-wise

descending voting strategy for k=5. Later, different thresholds are tested and it is verified that a

letter by letter voting performs best. However the meaning of such a threshold is not very clear.

FEIT – Faculty of Electrical Engineering and Information Technologies, University of

Skopje, Skopje, Macedonia:

FEIT team from University of Skopje was placed in last during this task with a score of

286.48. Only one descriptor, the EHD, was used to extract information from the images. These

were divided in 16 < 16 blocks and, for each, edge histogram corresponding to 5 different

orientations: vertical; horizontal; 45º degrees; 135º degrees and non-directional edges, were

concatenated in a vector of 80 elements. For classification an axis-wise strategy was adopted

using top-down induction decision trees, random forests with bagging, taking into consideration

the maximization of the reduction of variance for a better cluster homogeneity. An ensemble of

4 training and 4 test sets, for each axis, of 100 un-pruned trees was created and the feature

subset size for random forests was set to 7. As examples travel thorough the tree a threshold is

defined for the Euclidian between the variances of the two nodes.

39

Discussion for the 2008 Medical Annotation Task

As expected the 2008 Medical Annotation Task led to a better use of the hierarchy, with the

best runs, except the ones from TAU and RWTH-mi, using wildcards during the process. The

motivation for this was the increase of image classes from 116 to 196, which posed a challenge

for classifiers because these rely deeply in examples to perform annotation. Image classes where

wildcards were most often used (sometimes with 8 or more wildcards) were the less represented

in the training set. The amount of wildcards used for the totality of runs ranges from nearly

1000 to 7000 [50]. The winner run, from IDIAP team, uses 4148 wildcards.

Results for this task vary much more than the previous year’s tasks. Only IDIAP and TAU

teams achieve error counts below the baseline (RWTH-mi run). Nevertheless their difference is

quite large. Other methods achieved scores closed to the baseline error. From these perhaps

MIRACLE saw a major improvement when comparing with previous years. Like in the 2007

Medical Annotation Task , SVM’s and local descriptors based in dense sampling (IDIAP team)

or image patches (TAU team) together with a BoW outperformed all other classifiers and

descriptors.


This was the last time that the Medical Annotation Task took place. No further editions are

planned for a near future. An overall task, comprehending a survey to all the tasks from

previous years, was proposed [45]. A total of 12677 images were made available for training

and 1733 images for testing. Images correspondent to three IRMA codes existent in previous

years, 1121-120-450-700, 1121-120-700-400 and 1121-490-913-700, were discarded. Teams

were allowed to submit only one annotation method for all years. Nevertheless small variations

inside the same method, like, for example, different parameterizations, were allowed.

Error evaluation suffered some changes due to the mixture of data for different years in the

same test set. Moreover, the “*” classification option was also introduced for the 2005 and 2006

annotations. For the two first years a corrected classified code yields no error while a

misclassified one holds an error of 1.0. For the “*” half of the maximum error, 0.5, is given.

Table 2.6 gives a small example for this error counting scheme.

Classified Error Count

18 0.0

26 1.0

* 0.5

Table 2.6 – Error count scheme considering the “*” for 2005 and 2006 data (From: [45])

40

The usage of a test data for multiple tasks introduced the concept of clutter class. Because

not all classes are considered in the evaluation for a specific year a cluster class is assigned for

all classes that do not have any expression therein. Hence, any annotation for a clutter class is

not subject to error. Table 2.7 exemplifies the error score for a cluttered class.

Classified

(2005-2006) Error Count

18 0.0

26 0.0

* 0.0

C 0.0

Classified

(2007-2008) Error Count

111 0.000000

11* 0.000000

1** 0.000000

*** 0.000000

*C* 0.000000

Table 2.7 – Error count scheme considering the clutter class for 2005-2008 tasks (From: [45])

The details about the number of classes per task, for training and annotation, are depicted in

Table 2.8. It is convenient to remember that despite the distribution of data is similar for 2006

and 2007 the annotation was performed without/using the IRMA code respectively. When not

using numerical classes for annotation (2005 and 2006) each of these can contain several types

of images corresponding to distinct IRMA codes.

Data Distribution for 2009 Medical Annotation Task

Training 2005 2006 2007 2008

Classes 57 116 116 193

Images 12631 12334 12334 12677

Clutter 46 343 343 -

Test 2005 2006 2007 2008

Classes 55 109 109 169

Images 1639 1353 1353 1733

Clutter 94 380 380 -

Table 2.8 – Data distribution for the 2009 Medical Annotation Task tasks (From: [45])

This task was particularly extensive and all the teams that have participated in previous years

were invited to do so again. Only 7 teams submitted a total of 19 runs. Table 2.9 shows the final

standings for the participants of this task.

41

Rank Team Error Count

Sum 2005 2006 2007 2008

1 TAUBiomed 356 263 64.3 169.5 852.8

2 IDIAP 393 260 67.23 178.93 899.16

3 FEITIJS 549 433 128.1 242.46 1352.56

4 VPA 578 462 155.05 261.16 1456.21

5 MedGIFT 618 507 190.73 317.53 1633.26

6 IRMA 790 638 207.55 359.29 1994.84

7 DEU 1368 1183 487.5 642.5 3681

Table 2.9 – Final standings for the 2009 Medical Annotation Task (From: [45])

TAUBiomed – Medical Image Processing Lab, Tel Aviv University, Tel Aviv, Israel:

TAUBiomed won the task with an error sum of 852.8. The information extraction from the

images was performed identically as in the 2008 Medical Annotation Task and for classification

a multiclass SVM extensive grid search was conducted to optimize the parameters of a χ2 kernel

using 5 cross validations but taking into consideration the error count and not the error rate. The

IRMA code hierarchy was not used and in the end all ‘0’ positions were replacing by wildcards.

Therefore if the ‘0’ code was correct this strategy does not imply an additional error. Moreover

if the ‘0’ was wrongly placed in a last position then the error count is reduced to half. Model

training and annotation took approximately 90 minutes.

IDIAP – IDIAP Research Institute, Martigny, Switzerland:

IDIAP team achieved the 2nd

place with a sum score of 899.16 with the lowest error count

for the 2006 IRMA database. The method used is exactly the same of the previous 2008

Medical Annotation Task.

FEITIJS – Faculty of Electrical Engineering and Information Technologies, University

of Skopje, Skopje, Macedonia:

FEITIJS team ranked 3rd

with a sum score of 1352.56. The approach used was very similar to

the one used during the 2008 Medical Annotation Task. Besides the EHD descriptor the SIFT

was also used in a three stage process: first the keypoints and the correspondent descriptor were

extracted; second all keypoints were clustered in 2000 clusters; third a histogram of 2000 bins

was created by distributing the keypoints accordingly to the closest cluster. This is, indeed, a

42

BoW with 2000 words. This histogram was concatenated with the EHD descriptor and decision

trees were once again used to perform annotation using exactly the same approach of the 2008

Medical Annotation Task but with a feature subset size of 11.

VPA – Computer Vision and Pattern Analysis Laboratory, Sabanci University, Istanbul,

Turkey:

VPA team ranked 4th with achieving an error sum of 1456.21. Images were divided into 4 < 4 non overlapping blocks and for each a derived LBP histogram was extracted and

concatenated into a single, spatially enhanced histogram. This derivation of the LBP aimed to

capture edges, spots and flat areas, over the image. The feature vector totalized 944 elements

and a normalization between JK1, 11L was performed before the submission to the classifier.

For the annotation task a multiclass one-vs-all SVM with an RBF kernel was chosen.

Parameter configuration was done empirically (trial and error) using 5 disjoint subsets of 2000

images considering the minimum average error rate (maximum accuracy). From 3 strategies

used the axis-wise resulted best. One SVM was trained for each code axis and the final code

was predicted from the composition of all results from each model. The other 2 strategies

evaluated, ignoring the hierarchy and training SVM’s accordingly to the abundance of classes in

the training set, performed slightly worse.

MedGIFT – University and University Hospitals of Geneva, Service of Medical


MedGIFT scored an error sum of 1633.26 using the GIFT retrieval system. The method used

was similar to those of 2007 and 2008 Medical Annotation Tasks, with a 5-NN using GIFT with

8 gray levels performing consistently better for all years. An SVM approach using SIFT was

also attempted but performed poorly due to a poor configuration of the RBF kernel parameters.

IRMA – Department of Medical Informatics, RWTH University, Aachen, Germany:

IRMA team with a sum score of 1994.84 was placed 6th in the rankings with the usual

baseline run. As in previous years, the hierarchy was disregarded.

DEU – Dokuz Eylul University, Tinaztepe, Turkey:

DEU team ranked last during this task with a sum score of 3681. Local and global

descriptors were used with a k-NN algorithm for classification.

43

Discussion for the 2009 Medical Annotation Task and final remarks:

The methods used in the 2009 Medical Annotation Task were based in the best methods

from previous years. Therefore, the consistency of error counts ranking in each individual year

is of no surprise. Local image descriptors and support vector machines were still the best

options for image annotation. TAUBiomed team surpassed IDIAP team in all dataset years with

exception of 2006. This is a curious result since the difference in 2008 was too large. The

replacement of ‘0’ ending positions for a wildcard may have contributed to a better error count

however the interpretation made for this strategy has a dual point of view: the meaning of ‘0’,

unspecified, can be seen as possessing the same semantic meaning of a wildcard, ‘do not know’.

By assigning ‘0’ or a wildcard the information about a specific position is null. In the other hand

a slightly different interpretation can be made by understanding ‘0’ as unspecified because it is

surely not one of the other possible choices, e.g., as a rejection option, and a wildcard as ‘do not

know’ because there is not enough confidence to assign a specific classification. The extensive

grid search should not be excluded from the reasons for such a boost in performance. Local

descriptors also helped FEITIJS team to improve in the rankings in order to achieve the best

error sum for a non-SVM classifier approach, similarly to 2005. Curiously the VPA team

method also uses one of the image descriptors that performed best in 2007 and 2008 but, with a

different SVM Kernel, it performs poorly.

From 2005 until 2009 it is amazing to see the amount of strategies comprehending image

descriptors and classifiers. However there are still many aspects to consider in this problem. For

instance, the IRMA database consists only in x-rays, with the T-axis locked in some positions,

but the IRMA code covers a wider range of modalities. Databases regarding these other

modalities and annotated with the IRMA code are still not available. Taking into consideration

this larger spectrum of medical image types and their correspondent classes would make the

whole problem to be readdressed. Also, many images from the 2007-2009 databases have

several unspecified axis/position in their IRMA code. Therefore, the success from methods

presented during the 2008-2009 task years cannot be evaluated at their full extent, where such

incompleteness is absent. Is convenient to remind that the IRMA code is still under

development and its number of positions within an axis expected to grow. This will have an

impact in the error evaluation schema.

As a last remark we state that generally there were no particular criteria in the image

descriptors selected for the works reviewed, with many relying on feature extraction engines.

Possibly this choice is a function of image descriptors used by the authors regarding others

works on their research. The only exception is the work from the RWTH University, home of

the IRMA database.

44

2.42.42.42.4 Other IRMA database related workOther IRMA database related workOther IRMA database related workOther IRMA database related work

In [51] a mammographic database annotated with the IRMA code is presented. This database

contains more image direction and anatomical information than the mammographic images

existent in the IRMA database. It is a good example of the IRMA code evolution. Another

IRMA database related work can be found in [52]. Here a subset of the IRMA database

consisting in 9100 images were annotated as belonging to 40 distinctive classes. The IRMA

code is not used in this database, but only two single annotations regarding the anatomy and

direction. The paper presents experimental results on a new merging schema for medical image

classification where the 40 classes are merged into 25 hierarchically superior classes. For this a

large number of image descriptors are extracted from the images: shape features, like invariant

moments, Fourier descriptors and axis orientation; texture features, namely from statistic origin

like energy, homogeneity, contrast and correlation; finally a tessellation-based spectral feature

as well as a directional histogram, both in multi-scale space. There is a feature selection using a

backwards algorithm and classification, in a first stage, is performed using a multi-layer

perceptron. Thereafter, the merging scheme, an iterative procedure based in a distribution

function estimation using a mixture of Gaussian functions and the expectation-maximization

algorithm, is applied and results as encompassed in the 25 hierarchically classes. The 9100

images dataset was divided in a training set consisting in 7861 images and a test set consisting

in 1239 images. Ultimately the merging class system had an accuracy of 90.83%.

45

Chapter 3

Background Information

3.13.13.13.1 The image domainThe image domainThe image domainThe image domain

HE use of digital images dates from 1920, when the Bartlane Cable picture transmission

service was used to transfer images between London and New York. These were codified

in 5 gray levels (later 15) and reconstructed using a telegraph printer. The use of digital images

as we know today appears in the 1960’s when improvements on computing technology and the

onset of the space race led to a surge in digital image processing, especially in the enhancement

of pictures of the moon taken by Ranger and Apollo missions [53]. In the medical field the

digital image appears in the 1970’s and its importance is recognized in 1979, when Sir Godfrey

N. Hounsfield and Prof. Allan M. Cormack shared the Nobel Prize in medicine for the invention

of tomography, the invention behind the Computerized Axial Tomography. But what is an

image? The image, in a literal definition, is a two-dimensional pictorial representation. The

digital image is an approximation of a two-dimensional image by set of values called pixels or

textals. Each pixel is described in terms of its color, intensity/luminance or value. Each digital

image1 has a limited extent, window or size, an outer scale, and a limited resolution, the inner

scale.

Mathematically the image is a real function 2: N2 O N mapping two real variables into a

third real variable. Thus 2�P� � 2�;, Q� � R where �;, Q� is the full spatial domain of the

image, i.e., a set of points in the Cartesian plane, and R is the luminance/color/value of the

image point, with P � N� and R � N. The value R can be interpreted in many ways and it is not

necessarily a positive value. Also, depending on the color system used, R may have a higher

dimensionality. In the digital image P � �;, Q� are pixel coordinates, whose values are bounded

by the image size 0 < B where 0 � �1,2, … , @S� and B � �1,2, … , TU�. Therefore, the digital

image can be seen as a 0 < B matrix of elements.

In order to retrieve images according to a given query we need to enhance its relevant

elements while reducing the remaining aspects. This is the goal of image processing.

Generically, we act on the image using an operator, V, over the full spatial domain of the

1 For easy reference a “digital image” will be onwards addressed simply as “image”.

T

46

image, 2�;, Q�, an image patch, 2�W, X�, or an interest point, 2�W/, XY�, to generate a feature

space containing the information needed to identify the objects in the following way:

$�P� � V Z 2�;, Q� (3.1)

$�P�[,\�� V Z 2�W, X� (3.2)

$�P�[],\^�� V Z 2�W/, XY� (3.3)

where 2�;, Q� is the full image; 2�W, X� is an image patch, i.e., a connected subset of Cartesian

points with �W, X� � �;, Q�, & W, X � N; is the R value at a interest point �W/, XY�, with 0 � �1,2, … , @[� and B � �1,2, … , T[�.

3.1.13.1.13.1.13.1.1 Image propertiesImage propertiesImage propertiesImage properties

In Chap. 1 it was stated that there is not a clear definition of what image content is. Instead,

relationships between image properties like color, shape, texture and interest points are certain

to be fundamental for its characterization. But what are exactly these properties and how are

they used for CBIR purposes?

3.1.1.13.1.1.13.1.1.13.1.1.1 Color Color Color Color

Natural Conceptualists say that color is the product of an epistemology conceptualization:

an object is blue because we learn that it is blue and accept it as a “truth”. This is acceptable if

we can explain why it is “truth”. Therefore we are left with a difficult philosophical problem

where there is no consensus on the origin of color. Two very distinctive points of view exist:

some theorists agree that color is a perceiver-relative property of the objects, e.g., dispositions

or powers to induce experiences of a certain kind, or to appear in certain ways to observers of a

certain kind while others state that they are objective physical properties of the objects, e.g.,

color rely on the physical microscopic properties of the bodies and are, therefore, irreducible

[54]. There are many theories about color: Color Fictionalists state that there are no colors at all

(!) when exploiting the gaps of other theories, thus supporting a perceiver-relative point of view;

Simple Objectivists stand for the concept of color is either related with physical properties of

the objects or in the nature of light, hoping that science will provide an answer; Ecologists (!)

diverge slightly from Color Fictionalists arguing that it is an relational property between the

environment and the individual. There are others theories attempting a “unified” definition of

47

color but so far an agreement seems impossible. For the interested reader details of color theory

can be found in [54].

In CBIR, color is a widely visual feature to categorize objects. For this the variable R is

expressed in terms of a color space to represent image colors. The RGB (Red, Green, Blue)

system is commonly used to represent color images where the gray level intensities are

expressed as a sum of red, green and blue grey level intensities. There are many other color

systems, like the HSV (Hue, Saturation, Value) or the CMYK (Cyan, Magenta, Yellow, Key).

The image can also be represented as an 8-bit grayscale image, where pixel intensity is

registered in terms of 256 shades of gray, or as a 2-bit binary image, in black and white. A

quick reference to color systems and conversions between systems can be found in [55].

Depending on the color system, one or more histograms are employed to quantify the color

distribution, defined by the number of bins used. Differences in color distribution are,

sometimes, essential to determine differences between images. However such distribution can

lead to errors when different images present similar histograms. Aiming to capture spatial

relationships between colors, the image is partitioned in smaller subimages and a color

histogram extracted from each of these. This results in the color layout of the image. Exploring

correlations between pairs of similar colors based on their mutual distance within the image can

also be explored in what is called a color auto-correlogram.

3.1.1.23.1.1.23.1.1.23.1.1.2 ShapeShapeShapeShape

The shape of an object is its apparent form. In the image domain extracting shape consists in

the identification of lines and curves. Shape extraction is already a well developed field in

image processing where two main streams exist: gradient-like methods using directional

maxima lookup to quantify the edge strength, like the Canny edge detector, Sobel, Prewitt and

Robert’s operators, and second derivative zero crossing search methods, like the Laplacian-like

approach. Other methods based in the Hough transform, curve propagation and wavelets also

exist. Quantification of edges is made using histograms considering the full image spatial

domain or, like in color, subimages.

3.1.1.33.1.1.33.1.1.33.1.1.3 TextureTextureTextureTexture

The concept of texture is somehow intuitive, being closely related with visual patterns

perceived in the surface of objects that present homogeneity. However its definition is not exact.

Coggins summarizes several definitions for texture in literature [56] just to find out that each is

48

adjusted to the context of the works therein presented. Perhaps a good definition for texture for

image processing is that it is a function of the spatial variation in pixel intensities [57].

In image processing texture representation relies mainly in two methods: structural and

statistical. Structural methods aim to identify texture structural primitives and placement of its

elements, looking for regularity. Examples of this method are adjacency graphs or

morphological operations. Statistical methods use first or higher order statistics to analyze the

distribution of luminance on the image. These include the popular co-occurrence matrix, Fourier

power spectra or shift-invariant principal component analysis (SPCA). In CBIR similarity

metrics are the most used method to compare the textures of images during the retrieval process.

In [57] a review of texture extraction methods can be found.

3.1.1.43.1.1.43.1.1.43.1.1.4 Interest pointsInterest pointsInterest pointsInterest points

Interest points are themselves the result of an operation comprehending the full spatial

domain �;, Q�. An image descriptor comprehending an image patch around these points, instead

of its putative use, plays a role in image retrieval. Therefore, equations (3.2) and (3.3) are

related if the patch is centered on an interest point. Examples of interest point detector are the

Difference of Gaussians (DoG), the Harris corner detector of the Hessian matrix. In CBIR the

use of image descriptors around interest points is grounded in two methods: direct image

matching, to find the same image in a collection, or together with a bag-of-words approach to

capture image concepts. A review of image descriptors using information around interest points

can be found in [31].

3.1.23.1.23.1.23.1.2 Image descriptorsImage descriptorsImage descriptorsImage descriptors

It is very difficult to ascertain which image properties are fundamental to characterize a

specific image. It depends on the context of the problem we want to solve and the knowledge

within the image itself. If we want to recognize specific objects in a scene probably the shape

property is more relevant than the others. However if such objects have a distinct color then the

relevance of this property is higher than the rest. If we want to detect light bulbs in a night

scenario we rely on a interest point detector. Sometimes the image can be quite complex and all

properties are essential for its characterization.

Equations (3.1-3) enable the quantification of the image properties originating image

features. Such quantification is an attempt to bridge the perception of image properties by the

human vision mechanism to mathematical measure(s) taken from the image. This is the aim of

an image descriptor. Image descriptors can be seen as global, like (3.1), or local, like (3.2) and

(3.3) if either the whole image spatial domain is used or only a restricted spatial domain is used.

49

An important aspect of information retrieval from an image is that such information remains

unchanged under different conditions. Thus, descriptors invariant to illumination conditions,

distortion, clutter, occlusion or rotation will provide valuable information for the retrieval

process in CBIR.

In addition to image descriptors focusing in color, shape and texture other descriptors

include composite descriptors. Table 3.1 resumes the image descriptors used in this work, their

spatial domain and their covered image properties.

Image descriptors can also be perception-based if the computed image features are believed

to be related with human perception, like the Tamura textures, or machine-centered, by simply

computing a series of statistical measures from the image that through experimentation prove

their value in specific problems. Good examples of these are the Haralick features [58],

including the co-occurrence matrix. Machine-centered image descriptors have the advantage of

performing analysis in aspects not captured by the human vision.

Descriptor

Spatial Domain Image Properties

Global Local Color Shape Texture Interest

Points

Tamura Textures x x

Edge Histogram (EHD) x x

Color Layout (CLD) x x

Scalable Color (SCD) x x

Color and Edge Directivity (CEDD) x x x x

Fuzzy Color and Texture (FCTH) x x x x

Spatial Envelope (GIST) x x

Speeded Up Robust Features (SURF) x x

Table 3.1 – Image descriptors used in this work classified accordingly to their spatial domain

and image properties covered. Low-level descriptors, like Tamura, EHD, CLD and

SCD, involve only one image property. CEDD and FCTH are mid-level

descriptors. SURF uses gradient information around interest points and can be seen

as a mid-level descriptor. However its use to construct a bag-of-words makes it a

high-level descriptor.

Some of the descriptors used are compliant with the Movie Picture Expert Group (MPEG-7)

standards. The MPEG-7 defines a Descriptor Definition Language (DDL) for descriptor

schemas in multimedia content. For images an MPEG-7 descriptor should be compact in order

to be added as metadata and with proven results for image retrieval. Incorporating a set of image

50

features as part of a standard in the image file metadata can improve image retrieval systems

because the computation of such features is bypassed. With exception of GIST and SURF all

descriptors presented here are MPEG-7 descriptors. More information about image descriptors

that are compliant with MPEG-7 standards can be found in [59].

In this work we selected image descriptors based in several criteria: first we intend to capture

image information with respect to the image properties described in the previous section; second

is an availability issue as we enforced the use of image descriptors with code provided by its

authors or part of a image information extraction engine; third, we wanted to use recently

proposed descriptors, CEDD and FCTH, with machine learning methods rather than similarity

measures. Like in the related works, we did not choose our image descriptors based on the

nature of the images. Most are general image descriptors used in a large variety of problems.

Therefore this work is also a test to their robustness.

3.1.2.13.1.2.13.1.2.13.1.2.1 Tamura Texture FeaturesTamura Texture FeaturesTamura Texture FeaturesTamura Texture Features

Based in psychological experiments Tamura [60] developed a series of computable features

that correlated with the human perception of images: coarseness, contrast, directionality, line-

likeliness, regularity and roughness.

Coarseness

Coarseness is computed from considerable spatial variations of gray levels. It is related,

although implicitly, with the size of primitive textural structures - textals – presented in the

image. In a first step n averages of 2_ < 2_ image sub-windows (Figure 3.1), where 9 ��0,1, … , 0�, are computed around the central pixel

`_�W, X� � ∑ ∑ 2�#, '� 29⁄\��ab3c%�[c�ab3[��ab3c �[c�ab3 (3.4)

Then the absolute differences between pairs of non-overlapping averages in opposing sides,

both in the horizontal and vertical directions, are calculated

d_e�W, X� � f`_�W K 2_c, X� K `_�W 1 2_c, X�f (3.5)

d_g�W, X� � f`_�W, X K 2_c� K `_�W, X 1 2_c�f (3.6)

51

Figure 3.1– 2_ < 2_ image sub-windows for coarseness extraction (From [60]).

Now considering the maximum difference found

hPVBhWid_j�W, X�k (3.7)

where 9 � �0,1, … , 0� and l � �m, n�, its corresponding scale oY�[�W, X� � 2_ is used to

compute the coarseness over the entire image

T pq � /<Y ∑ ∑ oY�[�#, '�Y%/ (3.8)

with 0 < B the image size.

Contrast

Contrast measures how gray levels vary in the image and the extent to which their

distribution is biased either to white or black. If an image has a low contrast it is expect that its

histogram is a Gaussian function. If it is a Gaussian function then it is unimodal. Polarization

gives a measure of the number of peaks that a distribution has and can be estimated by the

kurtosis

rs � tuvu (3.9)

where

ws � dJ2s�W, X�L (3.10)

xs � dJ�2�W, X� K w�sL (3.11)

are the 4th moment about the mean and the squared variance respectively.

52

If the kurtosis is platykurtic (negative) then there is a peaks distribution in the histogram of

the image, i.e., it is not Gaussian. Otherwise if the histogram is unimodal then a leptokurtic

(positive) kurtosis is expected. The contrast is then defined as

T y/zp�qz � v�{u�3u (3.12)

Directionality

To compute the directionality of an image a histogram of local edges at different orientations

is constructed using a Sobel filter. This histogram is expected to be uniform for images without

a strong orientation and to exhibit peaks for images with high directionality. The estimation of

the sharpness of the peaks in the histogram, by summing the second moments around each one

of them, gives the measure of directionality

Typ |/z|j � 1 K P · 0~ · ∑ ∑ �� K �~�� · ��A�/�~ (3.13)

where P is a normalization factor related to the quantization of the angles, 0~ is the number of

peaks, �~ are the points at peak p and �~ is the position of peak p in HD. All summations are

stored in a 16 bin histogram.

Line-likeliness, Regularity and Roughness

The last Tamura features are related with the previous described features, thus adding little in

terms of textural discriminative power. Line-likeliness is the average coincidence of edge

directions that co-occurs in pair of pixels at a distance d along the edge direction. This is

measured using the cosine difference between the angles.

The regularity is defined as

�� 1 K P · ��y�pq|/|qq 1 ��y/zp�qz 1 �� p| z y/�6 z\ 1 �� /|c� _|6 /|qq� (3.14)

where P is the normalization factor stated before and � is the standard deviation of the measures

stated in the subscript indexes. Roughness is the sum of Coarseness and Contrast.

3.1.2.23.1.2.23.1.2.23.1.2.2 Edge Histogram Descriptor (EHD)Edge Histogram Descriptor (EHD)Edge Histogram Descriptor (EHD)Edge Histogram Descriptor (EHD)

The EHD captures the spatial distribution of edges around the image in five orientations:

horizontal, vertical, 45° degrees, 135° degrees and non-directional (Figure 3.2).

53

Figure 3.2– Types of edges in the Edge Histogram Descriptor (From [61]).

Images are divided in blocks of size 2/ < 2/ pixels, usually 16 < 16 pixels, and, for each

sub-image, one or more digital filters (Figure 3.3) are applied to detect the edges towards the

mentioned orientations

Figure 3.3– Example of a commonly used edge detector operator (From [61]).

Quantification of the number of edges for each sub-image takes into consideration its

maximum magnitude. Consider the matrix operators in Figure 3.3. Let Bg�#, '�, Be�#, '�, Bjcs��#, '�, Bjc��#, '�, B/j�#, '� be the magnitudes of the vertical, horizontal, 45 degrees,

135 degrees and non directional edges of the �#, '� image block. Let $g�9�, $e�9�, $jcs��9�, $jc��9�, $/j�9� be their respective labels. Then, the magnitudes of each block can be

computed as:

Bg�#, '� � f∑ h_�#, '� < $g�9��_�� f (3.15)

54

Be�#, '� � f∑ h_�#, '� < $e�9��_�� f (3.16)

Bjcs��#, '� � f∑ h_�#, '� < $jcs��9��_�� f (3.17)

Bjc��#, '� � f∑ h_�#, '� < $jc��9��_�� f (3.18)

B/j�#, '� � f∑ h_�#, '� < $/j�9��_�� f (3.19)

where h_�#, '� are the pixel intensity values of the �#, '� image block. Given a magnitude

threshold we construct an histogram of edges for each block. Therefore the descriptor is a vector

of 80 elements. Furthermore, dividing the number of occurrences in each bin by the total

number of blocks normalizes the feature vector. A general flowchart for the extraction of the

descriptor can be found below in Figure 3.4.

Figure 3.4– Flowchart for the Edge Histogram Descriptor.

3.1.2.33.1.2.33.1.2.33.1.2.3 Color Layout Descriptor (CLD)Color Layout Descriptor (CLD)Color Layout Descriptor (CLD)Color Layout Descriptor (CLD)

The CLD [59] specifies the spatial distribution of colors. In a first stage images are divided

in 8 < 8 � 64 blocks and then the dominant color in each block is extracted to build an image

of size 8 < 8. The color method commonly used is the color average but any other method can

be an option during this second phase (Figure 3.5).

55

Figure 3.5 – Flowchart for the Color Layout Descriptor

In a third stage each of the three �Q, EC, EP� color space components of the 8 < 8 image is

transformed by a Discrete Cosine Transform (DCT) where three sets of coefficients are

obtained. In the end a non-linear quantization from a zigzag scanning of the image is used to

weight these coefficients producing the feature vector.

3.1.2.43.1.2.43.1.2.43.1.2.4 Scalable Color Descriptor (SCDScalable Color Descriptor (SCDScalable Color Descriptor (SCDScalable Color Descriptor (SCD))))

SCD [59] is a histogram of colors in the HSV color space. In a first step the Hue (H)

component is quantified in a 16 bins histogram while Saturation (S) and Value (V) and

quantified in 4 bins histograms (Figure 3.6). Afterwards a series of 1-D Haar wavelets are

applied to these histograms generating 16 low-pass and 240 high-pass coefficients. Some high-

pass coefficients can be discarded as they consist in low positive and negative values arising

from redundant information from the original histograms. By doing this, the total length of the

descriptor vector can be reduced to 128, 64, 32 or 16 bins if we discard the high-pass

coefficients completely.

Figure 3.6– Flowchart of the Scalable Color Descriptor

56

3.1.2.53.1.2.53.1.2.53.1.2.5 Color and Edge Directivity Descriptor (CEDD)Color and Edge Directivity Descriptor (CEDD)Color and Edge Directivity Descriptor (CEDD)Color and Edge Directivity Descriptor (CEDD)

The Color Edge Directivity Descriptor (CEDD) is a recently proposed composite image

descriptor [62] that captures and relates shape, texture and color from an image (Figure 3.7).

Figure 3.7– Flowchart of the Color and Edge Directivity Descriptor (From [62]).

In this descriptor we can consider the full image or an image block. The texture block

receives the input block in the YIQ color space and applies the EHD descriptor to construct a

histogram of 6 bins, five corresponding to the types of edges found in the image (Figure 3.2)

plus one for no edges of any type found. However the EHD is computed in a different way

because the magnitudes of the edges are normalized:

Bg � Y��S , Be � Y��S , Bjcs� � Y�bu��S , Bjc�� Y��b3��S , B/j � Y]��S (3.20)

where T`; � BhW�Bg, Be , Bjcs�, Bjc��, B/j� is maximum edge magnitude.

Then, given a threshold, an edge may fall in more than one of the five directional bins bin.

This determinates its texture. If the edge does not fall in any edge category then it belongs to the

last bin, corresponding to no edges.

For color each input block is processed in the HSV color space according to the types of

edges found previously. The first step is to map each edge block in a preset 10 color bins

histogram: Black, Gray, White, Red, Orange, Yellow, Green, Cyan, Blue and Magenta using a

Binary Haar Wavelet descriptor and a 20 fuzzy-linking rules method [62]. In a second stage this

histogram is expanded into a 24 bin color histogram by using Coordinate Logic Filters (CLF)

for vertical edge detection in all three HSV channels: Hue is divided into 8 areas: Red to

Orange, Orange, Yellow, Green, Cyan, Blue, Magenta and Blue to Red; Saturation is divided

into two fuzzy regions defining the shade of a color based in white; Value channel is divided

into three areas: one defines when the pixel (block) will be black and the other two, in

57

combination with Saturation, when it will be gray. Based in this area division a set of 4 fuzzy-

like rules are applied transforming the previous 10 color into a 24 color bin histogram

comprehending Black, Gray, White, Dark Red, Red, Light Red, Dark Orange, Orange, Light

Orange, Dark Yellow, Yellow, Light Yellow, Dark Green, Green, Light Green, Dark Cyan,

Cyan, Light Cyan, Dark Blue, Blue, Light Blue, Dark Magenta, Magenta and Light Magenta.

Color information processed to every edge type yields a 6 < 24 � 144 bins histogram.

Ignoring the last component of the color unit leads to a 6 < 10 � 60 bins histogram name

Compact Color and Edge Directivity Descriptor (CCEDD). To meet MPEG-7 definitions this

feature vector undergoes a Gustafson Kessel classifier to map the final histogram bin values

from decimal to integer.

3.1.2.63.1.2.63.1.2.63.1.2.6 Fuzzy Color and Texture Histogram (FCTH)Fuzzy Color and Texture Histogram (FCTH)Fuzzy Color and Texture Histogram (FCTH)Fuzzy Color and Texture Histogram (FCTH)

The Fuzzy Color and Texture Histogram (FCTH) is also a recent composite descriptor [63]

that resembles the CEDD aiming to capture the image texture, shape and color (Figure 3.8).

Figure 3.8– Flowchart of the Fuzzy Color and Texture Histogram (From [63]).

In the texture unit a Haar transform, at one level, is applied to the luminosity from the YIQ

color space given an input block resulting in four frequency bands each containing 2 < 2

coefficients. For example, considering an 4 < 4 image block, the HL band coefficients

are iE_6 , E_,6�, E_�,6 , E_�,6�k. From here one feature is computed as:

$ � �s ∑ ∑ E_�,6�%�6%��6 �� 3� (3.21)

The features for the LH and HH bands are computed similarly. These features are moments

from wavelet coefficients are effective to discern the image texture because coefficients in

58

different directions bands signal variations in different directions. For instance, HL band

discriminates activities in the horizontal directions while vertical directions show high energy in

this band and low energy in the LH band. Feature computed by 3.21 undergo a fuzzy system

which shape an 8-bin histogram representing several areas: Low Energy Linear Area; Low

Energy Horizontal activation; Low energy Vertical activation; Low energy Horizontal and

Vertical activation; High Energy Linear Area; High Energy Horizontal activation; High Energy

Vertical activation; High Energy Horizontal and Vertical activation.

The remaining procedure for the color unit is similar to the CEDD descriptor and takes into

consideration each of the areas computed in the texture unit. Therefore, the FCTH descriptor

results in a 8 < 24 � 192 bin histogram. To meet MPEG-7 definitions this feature vector also

undergoes a Gustafson Kessel classifier to convert the bin values from decimal to integer. The

compact version of the FCTH descriptor disregards the 24-bin fuzzy linking block in the color

unit yielding a histogram with of 8 < 10 � 80 bins.

3.1.2.73.1.2.73.1.2.73.1.2.7 Spatial Envelope (GIST)Spatial Envelope (GIST)Spatial Envelope (GIST)Spatial Envelope (GIST)

The Spatial Envelope (GIST) is an image descriptor developed for natural scene

categorization. Influenced by seminal approaches in computational vision that have depicted

visual processing as a hierarchical organization of modules of increasing complexity (edges,

surfaces, objects), one prominent view of scene recognition is based on the idea that a scene is

built as a collection of objects [64]. GIST processes the scene as a single entity, aiming for its

shape representation. This means that scenes belonging to the same category have a similar

shape or spatial structure. Since medical images are not likely to possess any particular objects

we found that this image descriptor could provide us some valuable information. Indeed, in [49]

the GIST is one of the image descriptors used in the 2007 Medical Annotation Task database.

In [64] 5 spatial envelope properties are considered: naturalness, openness, roughness,

expansion and ruggedness. However this approach is made considering the natural scene

images database used, containing landscapes with man-made objects (buildings, roads) or

natural landscapes (trees, rivers). For this reason the GIST is seen as an intermediate level

knowledge descriptor. Computation of the GIST descriptor is based in the spatial distribution of

spectral information by means of a Windowed Discrete Fourier Transform (WDFT):

2�W, X, $[, $\� � ∑ #�W�, X��mp�W� K W, X� K X��c%��[��\��c[,\�� (3.22)

59

where #�W, X� is the intensity distribution of the image window along the spatial variables �W, X�, $[ and $\ are the spatial frequency variables and mp�W�, X�� is a hamming window with a circular

support radius r.

The localized energy spectrum (spectrogram) along a number of pre-determined directions is

then computed as:

`�W, X, $[, $\�� 2 �W, X, $W, $X�� (3.23)

and gives the distribution of the signal’s energy among the different spatial frequencies,

providing localized structure information. The size of the feature vector generated depends on

the window size and directions intended.

3.1.2.83.1.2.83.1.2.83.1.2.8 Speeded Up Robust Features (SURF)Speeded Up Robust Features (SURF)Speeded Up Robust Features (SURF)Speeded Up Robust Features (SURF)

The SURF [65] is an invariant interest point descriptor for finding correspondences between

two images of the same scene or objects. The motivation for the development of the SURF is to

speed up this generic correspondence process as it is slow in another well known previously

mentioned interest point descriptor, the SIFT. The methodology involved consists in

considering the integral image, e.g, the sum of its gray space pixel values, and a second order

Haar wavelet as an intermediate image representation. Interest points are then located using a

Hessian matrix:

� � ¡[[ ¡[\¡[\ ¡\\¢ (3.24)

where ¡[[�W, X, x� is the Laplacian of the Gaussian (LoG) of the image. It is the

convolution of the Gaussian second order derivative with the image. In SURF the

second order derivatives are approximated with box filters (mean/average filter),

depicted in Figure 3.9. By changing the weights of the filter we increase or decrease the

sensitivity of the detector.

Figure 3.9– Box filters used as an approximation of the Gaussian second order derivative (From

[65]).

60

The space scale analysis is performed with a constant image size during the feature

extraction. Given the scale space where an interest point is detected, s, a circular neighborhood

of 6s around this point. Then, to represent the descriptor directionality, �W, X� Haar wavelet

responses are computed and represented as a vector. Afterwards all responses within an angle of

60º centered in the vector direction are summed.

This circular region is split into 4 < 4 square sub-regions with 5 < 5 regularly spaced sample

points inside. For each region the �W, X� Haar wavelet response is computed and weighted with

a Gaussian Kernel centered in the interest point. Summing the responses for each region

separately yields a feature vector of size 32. Information on the polarity of the intensity changes

is then computed by extracting the sum of absolute value of the responses, originating a 64

elements feature vector. Both vector are then concatenated and normalized.

Computation of the SURF with a feature vector of 128 elements takes into consideration a

separate computation of the x, |W|, y and |X| responses for X ¤ 0 and X ¥ 0, thus doubling the

length of the 32 feature vector.

3.23.23.23.2 The Support Vector Machine (SVM)The Support Vector Machine (SVM)The Support Vector Machine (SVM)The Support Vector Machine (SVM)χχχχ

Support Vector Machines (SVMs) are applied to the problem of making predictions based on

previously seen examples in what is called inductive inference. In order to understand what an

SVM is we need first to consider what we are aiming for when we talk about classifier

performance. Because we hope that the algorithm predicts correct labels of previously

unlabelled data it is natural to measure the performance of a classifier by the probability of

misclassification of an unseen example. The problem is that to establish such probability we

need to know the true underlying probability distributions of the data we are dealing with. If we

actually knew this then there would be no need for inductive inference. Indeed, the knowledge

of the true probability distributions allows us to calculate the best theoretically decision rule

corresponding to a Bayesian Classifier. Perhaps a good way to learn the probability of

misclassification is to use real data for which class labels are known by comparing these with

the ones predicted by the learning algorithm. This misclassification probability can be estimated

by using the learning algorithm in disjoint subsets of our real data in what is called cross-

validation or, more specifically, n-fold cross validation, where n is the number of subsets used.

This is the key idea of a learning algorithm.

The SVM is a supervised learning algorithm that receives labeled examples as input and

outputs a mathematical function that used to predict labels of new examples. Given the space

from where the examples are taken there are infinite hyperplanes, or linear functions, that can

χ Most contents in this section can be found in [66].

61

separate two distinct classes (Figure 3.10). The main idea behind the SVM is to know which

separating hyperplane is the optimal.

Figure 3.10 – Possible separating hyperplanes separating labeled examples in their space

representation.

In the case of the linear separable problem the SVM generates a mathematical function, V�W�, that receives as input another function, representing the known examples, or training set,

and outputs a label:

V�W� � �#V0�$�W�� (3.25)

where $�W� � ¦§, ¨© 1 C, with w a weight vector and b a scalar. The inner product ¦§, ¨© is

defined as:

¦§, ¨© � ∑ � W j � (3.26)

where d is the dimensionality and � is the i-th element of w, § � ��ª, ��, … , �j�.

Then, we can formalize the problem addressed by the linear SVM as given a training

set of vectors W, W�, … , W/ with corresponding class membership labels X, X�, … , X/ that take

on the values +1 or -1, choose parameters w and b of the linear decision function that

generalizes well to unseen examples.

The decision rule for the choice of the best hyperplane is that it not only correctly separates

two classes in the training set, but lies as far from the training examples as possible. Therefore,

the search for such hyperplane is an optimization problem. To solve the optimization problem

we need an objective function as well as a set of restriction regarding the intended hyperplane

(Figure 3.11).

62

Figure 3.11 – Optimal hyperplane (solid line) in a linear separable classification problem

(From [66]).

In order to our hyperplane correctly separate the two classes we need two sets of constrains:

¦§, ¨«© 1 C ¥ 0, for all X � 1 (3.27)

¦§, ¨«© 1 C ¤ 0, for all X � K1 (3.28)

that can be combined as:

�¦§, ¨«© 1 C�X ¥ 0, # � 1, … , 0 (3.29)

The set of constrains (3.27) and (3.28) means that the data must be classified in the correct

side of the hyperplane. However these are not sufficient to separate the two classes optimally.

We need to do so with a maximum margin. The hyperplane satisfying ¦§, ¨© 1 C � 0 in Figure

3.11 is the optimal hyperplane. The function ¦§, ¨© 1 C is +1, in the upper right, or -1, in the

lower left is represented by the dashed lines. In order to maximize the margin distance these two

dashed hyperplanes must be equidistant from the optimal hyperplane and at the same time

parallel to each other. This constrain can be written as:

X �¦§, ¨«© 1 C� ¥ 1, # � 1, … , 0 (3.30)

63

Now the margin can be maximized subjected to this constrain. This distance is equal to 2 ±¦§, §©⁄ . Since maximizing 2 ±¦§, §©⁄ is the same as minimizing ¦§, §© we end up with the

following optimization problem:

²«³§,´ � §. § µ¶·¸ ¹¸º¹ X �§ · ¨« 1 C� ¥ 1 (3.31) »¼½ º¾¾ # � 1, … , 0

However situations may arise when the data is not linearly separable (Figure 3.12). For these

cases we need to soften the constrains to allow these data to lie on the incorrect side of the +1

and -1 hyperplanes by means of a penalization.

Figure 3.11 – Linear inseparable problem (From [66]).

Now we need to introduce a parameter C to balance the goals of margin maximization

separation and correctness of the training set classification. Various tradeoffs between these

goals are achieved by choosing C using cross validation in the training set. Our optimization

problem becomes:

²«³§,´,¿ � §. § 1 E ∑ À Y/ � µ¶·¸ ¹¸º¹ X �§ · ¨« 1 C� 1 À ¥ 1 (3.32) »¼½ º¾¾ # � 1, … , 0

64

While in (3.31) the restrictions could not be violated at all, in (3.32) we look for solution that

keep À values small. Therefore, we allow the point ¨ to violate the margin by an amount À . Then, the boundary points, or support vectors, play a significant role in the performance of the

learning algorithm. The value C trades between how large a margin we would prefer, as

opposed to how many of the training set examples violate this margin. This idea for linearly

inseparable data extends to more complex situations (Figure 3.12).

Figure 3.12 – A more complex linear inseparable case and its mapping into a linear

separable feature space (From [66]).

A linear classifier for the example in Figure 3.12 would never perform well. To overcome

this we assume that there is a mapping Φ that transform the initial data into a linear separable

feature space, probably with higher dimensionality, and perform normal SVM classification in

this space. If a reasonable margin can be achieved in the feature space then a good

generalization of the problem can be expected. With the increase of the dimensionality there

was some fear regarding the curse of dimensionality because it could be difficult to find a

classifier that generalizes well if the number of examples is inferior to the dimension of the

feature space. However Vapnik [67] proved otherwise, opening the way for researchers to

further explore methods that map data into high-dimensional spaces where maximum margin

classifiers could perform.

Linear classifiers in high-dimensional spaces can take a considerable amount of time to solve

the maximum margin optimization problem. A way to approach this difficulty is to convert the

soft margin SVM problem into an equivalent Lagrangian dual problem. If the problems are

equivalent then the solutions must be the same. The new optimization problem becomes:

65

²«³Â � ∑ X X%r r%ÃÄ,Å� �¨« · ¨Æ� K ∑ r Y � µ¶·¸ ¹¸º¹ ∑ X r � 0Y � (3.33) 0 ( r ( E, # � 1, … , B

where the r ’s are the dual variables of the problem, corresponding to the primal variables w

and b by:

§ � Ç r X ¨ Y

�

(3.34) r �X �¦§, ¨«© 1 C� K 1� � 0

Using the inner product rule ¦º 1 ´, ·© � ¦º, ·© 1 ¦´, ·© we can write the decision function

as:

$�¨� � ¦§, ¨© 1 C � ∑ r X ¦¨«, ¨©Y � 1 C (3.35)

where the sign of f(x) gives us the predicted label. In order to determine the optimal values r and b and to calculate f(x) we do not need to know the training or testing vectors but only

their inner product with one another. Then there is no need to explicitly map the data in the

initial space in a new feature space. What we need to know is a kernel function equal to the

inner product of the mapped data:

È�¨, É� � ¦Ê�¨�, Ê�É�© (3.36)

The kernel function should be a good measure of similarity between the vectors x and y and

has to satisfy a series of conditions known as Mercer’s conditions. The kernel function È�¨, É�

satisfies the Mercer’s conditions if for any square integrable functions h(x) and h(y) it is

definite positive:

Ë È�¨, É�¸�¨�¸�É� lWlXÌ Í 0 (3.37)

Some of the most popular kernel functions are:

• Linear: È�¨, É� � ¨ÎÉ.

66

• Polynomial: È�¨, É� � �γ¨ÎÉ 1 P�l, γ ¥ 0. • Radial Basis Function (RBF): È�¨, É� � Ð¨Ñ �KγÒ¨ K ÉÒ2�, γ ¥ 0. • Sigmoid: È�¨, É� � ¹º³¸�γ¨ÎÉ 1 P�.

The exact way to solve the optimization problem requires a quadratic programming. The

SVM using quadratic programming can take advantage when performing classification on

sparse data because at many cases the r ’s are equal to zero. The support vectors have r different from zero and are the hard cases to decide.

With the basics of the SVM understood we now will quickly refer to two aspects used in this

work. Notice that we only considered two possible classifications for the examples, -1 and +1.

However, in problems where more than two labels exist (Figure 3.13), a multiclass problem,

two approaches, one-vs-all or one-vs-one, are used.

Figure 3.13 – A multiclass classification problem case.

Given a problem with data from n classes the one-vs-all strategy we train n classifiers, one

for each class, by assigning the label +1 to the specific class examples and -1 to its complement.

Then, given an unlabelled example, we apply each of the classifiers separately. The decision of

a particular classifier does not influence the decision of the other classifiers. To chose a

particular class we rely on the maximum margin attained by means of a maximum score,

confidence value or probability. In the one-vs-one strategy we train, for a specific example, �0 K 1� classifiers separately, each against examples belonging to different classes, which

totalize 0�0 K 1� 2⁄ classifiers. For an unlabelled example we choose the class that is selected

by the most classifiers.

67

3.33.33.33.3 2007 Medical Annotation Task database2007 Medical Annotation Task database2007 Medical Annotation Task database2007 Medical Annotation Task database

In Chapter 2 we presented the IRMA database. It was clear that during the ImageCLEF

Medical Annotation tasks several subsets of this database were used. In this work we will use

the 2007 Medical Image Annotation database. This database comprehends 12000 images

belonging to 116 different classes. Initially it was separated into three different subsets: a

training set (10000 images), a development set (1000 images) and a test set (1000 images). The

training set was the first to be made available to the task participants. Following this release, the

development set was made available for validation purposes. These two sets, for which the true

classes of objects are known, were merged to perform the analysis of the unlabelled test set.

Some images from this database are depicted in Figure 3.14.

Figure 3.14 – Some examples from the ImageCLEF 2007 Medical Annotation Task

database.

The database characteristics are similar to the main IRMA database. All images have gray

level values stored in Portable Network Graphics (PNG) format. They are scaled proportionally

to their original size and fitted in a 512x512 maximum pixel window. The minimum amount of

each image class in the training set is 10 images. The amount of images per class in the training

set is uneven. The amounts of images in the training and test sets are proportional (Figure 3.15).

The number of images per class in the test set never exceeds the number of images in the

training set. All images share equal T1, T2 and T3 positions in the technique axis.

68

Figure 3.15 – Frequency of classes in the ImageCLEF 2007 Medical Annotation Task

database training and test sets. Two classes represented in the training set

are absent from the test set.

We also considered the training set as the joint training and development sets made

available. During our work we detected some images with several layers of repeated pixel

intensities. We corrected these images before feature extraction.

Our goal now is clearer as it is identical to the proposed for the 2007 Medical Annotation

task. We will used the examples in the training set to train a model using the SVM in order to

predict the correct labels of the test set. These labels consist in the IRMA code.

0 20 40 60 80 100 1200

500

1000

1500

2000

2500

Class

Fre

qu

en

cy

Database Class Frequency

Training Set

Test Set

69

Chapter 4

Methodology

4.14.14.14.1 Framework descriptionFramework descriptionFramework descriptionFramework description

NNOTATION of medical images in this work comprehends two systems, each one

undergoing a number of stages. We named these Normal Fusion System (NFS) and

Smart Fusion System (SFS). Both systems are similar during the initial stages of the annotation

process and differ only regarding the fusion methods involved.

Figure 4.1– Generic framework flowchart for NFS and SFS systems. The differences between

these consist in the Methods Fusion block.

The framework (Figure 4.1) consists in several sequential processing blocks. In the initial

stage we apply the image descriptors described in Chapter 3 to the database in order to extract

information from the images. Feature vectors from the training set will be used to train SVM

models based in three different annotation approaches in a second stage. These models are then

used for the annotation. Afterwards we will fuse these initial annotations in order to further

improve our results in a final annotation. From this fusion between methods we attain the final

IRMA code annotation for all images.

A

70

4.1.14.1.14.1.14.1.1 Feature extractionFeature extractionFeature extractionFeature extraction

MPEG-7 image features - Tamura Textures, EHD, CLD, SCD, CCEDD and CFCTH - were

extracted using a framework developed in C# from the Img(Rummager)1 feature extraction

engine Dynamic-Link Library (DLL) files. The GIST2 and SURF

3 were extracted with code

provided by its respective authors in MATLAB4. Details of the feature extraction block can be

seen on Figure 4.2.

Figure 4.2– The feature extraction block. MPEG-7 global image features were extracted using

the Img((Rummager) engine while GIST was extracted using code provided by its

authors. The only local image descriptor, SURF, was used to construct a dictionary

of visual words.

The resulting image features from each image descriptor were concatenated in the following

order: {CLD, SCD, CCEDD, CFCTH, EHD, Tamura Textures, GIST, SURF}, resulting in a

single 954 elements vector per image.

4.1.1.14.1.1.14.1.1.14.1.1.1 Global descriptorsGlobal descriptorsGlobal descriptorsGlobal descriptors

Some MPEG-7 descriptors were not used exactly as described in Chapter 3. For the Tamura

textures the line-likeliness, regularity and roughness were disregarded as they are functions of

coarseness, contrast and directionality. This means that from these features no new information

about the image is provided. Therefore, Tamura textures resulted in an 18 element feature

1 http://savvash.blogspot.com/2008/06/imgrummager-in-now-available-for.html

2 http://people.csail.mit.edu/torralba/code/spatialenvelope/

3 http://www.vision.ee.ethz.ch/~surf/download.html

4 http://www.mathworks.com

71

vector. CEDD and FCTH were used in their compact form, CCEDD and CFCTH,

comprehending the full image instead of image blocks. The remaining MPEG-7 image

descriptors, EHD, SCD and CLD, were used accordingly to their definitions.

For the GIST descriptor all images were resized to 256 < 256 pixels. The feature vector was

originated from 64 < 64 non-overlapping sub-windows in 8 different directions, yielding a

feature vector of 256 values.

4.1.1.24.1.1.24.1.1.24.1.1.2 BagBagBagBag----ofofofof----words modelwords modelwords modelwords model

We used the SURF together with a bag-of-words (BoW) model [32]. This model aims to

create a dictionary of visual term, representing image concepts, based in local image features.

Based on the dictionary of visual terms, local image features are quantized in a histogram of

visual terms. Unlike the previous image descriptors mentioned, with the exception of GIST, the

BoW tries to capture high-level features from the image instead of low-level content. The idea

behind the bag-of-words model is similar to the creation of dictionaries for text retrieval.

The creation of a BoW model starts with the extraction of local features from an image

dataset. These undergo a clustering algorithm where they are grouped accordingly to a metric.

The number of clusters/centers is user-dependent. Few clusters originate a small dictionary,

where different visual concepts may be represented by the same visual word, while too many

clusters may create visual words with do not represent a visual concept. The exact dimension of

the visual vocabulary to setup a dictionary is somehow database-dependent, requiring

exhaustive testing to reach a value to be used in order to achieve a desired result/performance.

Once the dictionary of visual terms is created, for every image we assign each local image

descriptor to a particular visual word if this is its nearest neighbor. Let Ó � Nj be a local image

descriptor with dimension d. We define the nearest neighbor as:

W�� W� � Ô|&W � Ô, W - W�: l#�Õ�Ó, W�� ( l#�Õ�Ó, W�� (4.1)

where D is our visual dictionary. This yields a frequency histogram of visual terms for the

image with a number of bins equal to the visual dictionary size.

In this work we built a BoW model with 512 visual words from SURF local image features.

At the same time it was our goal to build a sparse frequency histogram of visual terms with a

density of no more than 1 word per bin. The reason for this choice is based in the fact that not it

is intuitively preferable to compare frequency histograms that possess different words rather

than possessing the same word in different quantities, but also to take advantage of the SVM

quadratic solver. However, we noticed that some images had a large number of interest points,

more than 1000 sometimes, while others possessed very few, as low as 3 or 5. To overcome this

72

issue we modified dynamically the sensitivity of the interest point detector in the SURF code to

detect between 256 and 512 interest points.

To build the visual dictionary we selected uniformly 30 local descriptors per image in the

training set. The reason for this uniform selection is to avoid a region based sampling since in

the output descriptor file the SURF features are ordered accordingly to their �W, X� coordinates.

A total of 11000 < 30 � 330000 points were gathered and clustered in 512 centers with a k-

means1 algorithm using the Euclidean distance. Afterwards the frequency histograms were

assembled using (4.1) for both training and test sets.

4.1.24.1.24.1.24.1.2 Model training and image annotationModel training and image annotationModel training and image annotationModel training and image annotation

For the annotation task we relied on SVMs with a Radial Basis Function (RBF) kernel. We

set up a framework in MATLAB using the LIBSVM [68] multi-class implementation with

probability estimates considering three approaches: Flat, Axis-wise and Position-wise

annotations. The flat annotation disregards the image IRMA code completely by considering it a

whole class of objects. Here, each IRMA code was replaced by an integer number. The axis-

wise approach consists in the annotation of each IRMA code axis separately. The final IRMA

code is assembled regarding each axis independent result. These two approaches were the most

commonly used strategies in the IRMA database related work (see Chapter 2). However, both

disregard the hierarchical nature of the IRMA code. For this reason we decided to further

explore such hierarchy by introducing the position-wise approach.

The position-wise method operates in each axis codes separately. The algorithm involved is

as follows:

1. Isolate the highest hierarchical position (X1) of the axis, its root, and use the whole

training set to perform the initial annotation.

2. Group all previously unlabelled examples sharing the same annotated code.

3. For each group reduce the training examples to those images that match the

annotation given, in a semantic reduction of the training set, and train new SVM

models to classify the hierarchically subsequent inferior position.

4. Repeat this top-down process through the axis tree until it is completely classified.

Go to point 2 considering the current hierarchical level.

We undertake the same methodology for all axes and assemble the final IRMA code. As we

move along the tree, more groups will be created based on the ongoing annotation and the data

is systematically reduced.

1 http://www.cs.cmu.edu/~dpelleg/kmeans.html

73

It is clear that wrong decisions in early stages of the annotation process will result in

completely misclassified axis. However, correctly annotated images in early stages are expected

to be correctly annotated in the subsequent stages since data that could induce an error is

discarded during the semantic reduction.

For SVM model training we performed an extensive grid-search on the flat and axis-wise

strategies to optimize the RBF kernel parameters (γ, C) using 10-fold cross validation. Because

LIBSVM normalizes all input data, this search was conducted around 1 9: ,where k is the

dimension of the feature vector. A total of 108 models were trained for each axis in the axis-

wise method as well as the flat strategy. For γ we took into consideration the interval between

one magnitude order immediately greater and smaller than 1 9: , for a total of 9 values, and for

the cost we considered E � 2/ where 0 � �K4, K3, … ,7�. Parameter optimization for the RBF

kernel parameters in the position-wise approach raised a problem at this point. For the

Technique (T) and Biology (B) axis the axis-wise parameterization is suitable because in the

first there is only one position to annotate and in the second the annotation of the first position

determinates the position value of the subsequent positions. Some preliminary experiences were

done on a grid search for the first position in the Direction (D) and Anatomy (A) axes. Withal,

in these axes, the grid search returned a large set of optimal parameters. We found cumbersome

to pursue a more accurate result here. However, we noticed that among these results the best

RBF kernel parameterization for the axis-wise approach was also present. Therefore, even if

non optimal, we used the axis-wise approach best RBF kernel parameterization for the position-

wise approach.

Cross validation output during SVM training consists in the overall accuracy for each pair of

parameters. Aside the accuracy we also wanted to see the error count (2.3) that subsets of the

training data yield for the best RBF kernel parameters considering all approaches. For this we

divided the training set in 11 randomly selected disjoint subsets, each with 1000 images, and

performed the annotation of each one separately based on SVM models created from the

remaining 10000 images using the optimal parameters. The weights used were equal for all

feature vector elements.

4.1.34.1.34.1.34.1.3 Methods fusionMethods fusionMethods fusionMethods fusion

So far the methodology for the NFS and SFS systems are similar. They diverge only in the

fusion methods applied thereafter. We expected that a fusion of annotations derived from the

three approaches could improve our initial results. Such method led also to an improvement of

final results in related works (see Chapter 2). This process also offers the possibility of a

wildcard assignment for a particular position given the IRMA code error count evaluation

schema in (2.1-3).

74

In this work we perform only pairwised fusions. In the NFS system the fusion strategy

consists in majority voting for each axis independently. We called this a normal fusion. If the

position values are coincident then the final codes retains such value. Otherwise a wildcard is

placed (Example 4.1). Once a wildcard is placed the subsequent position will be assigned a

wildcard as well.

.1121 K 110 K 500 K 0001121 K 120 K 421 K 000× 1121 K 1 ++ K +++ K000

Example 4.1 – IRMA code fusion by majority voting. Code 1121-120-421-000 is an example

of a semantic meaningless code.

However, at this point we noticed that some codes assembled by the axis-wise and position-

wise approaches have no representation in the 116 classes’ existent in the database. This

happens when these approaches misclassify one or more axes, producing a semantic

meaningless code. These codes can be easily identified and we know that they possess an error.

The problem is that we do not know which of the axis/axes is/are wrong. In normal fusion we

disregarded these codes but we have also experimented to bypass them in the flat/axis wise and

flat/position-wise fusions. When such code is detected we assign the flat IRMA code annotation

as the final annotation. An example of such code can be seen in Example 4.1. Bypassing

meaningless codes in the axis-wise/position-wise fusion was not taken into consideration

because in many occasions the same image is annotated with a meaningless code with both

approaches.

In a second fusion method we attempt to identify potential wrong classified codes in the flat

annotation in order to undergo a normal fusion with the two other approaches. This method is

based in the fact that flat classification outperformed the two other approaches. In Chapter 5 this

result will be presented with detail. Nevertheless, it is herein intended in advance to provide a

rationale for the smart fusion method in the SFS system (Figure 4.3). Thus, if a wrong classified

codes in the flat method can be detected we expect a gain from it fusion with the other methods

if these provide a correct or partially correct annotation. If they provide also an incorrect code

different from the flat annotation we can still reduce the error count. If they are similar the error

does not change. The only issue is if false positives are detected. Here we can consider two

situations: if a correct flat code is classified as incorrect, the axis-wise and position-wise

annotations may also be correct. In this case normal fusion does not increase the error;

otherwise, the error increases if we merge a false positive detected in the flat annotation with an

incorrect code from other approach.

75

Figure 4.3– Detailed flowchart for the SFS system. The methods fusion block comprehends

normal fusion (majority voting), where the flat approach was considered our

baseline.

To identify misclassified codes in the flat annotation we proceed, considering one code, as

follows:

1. Store the probability estimates of the annotated code as the first element of the

feature vector.

2. Compute the average error distance between the annotated code and the subsequent 9 closest codes. For this we understand as the k-codes yielding the closest

probability values to the annotated code. Use this value as the second element of the

feature vector.

3. Group training examples that share the same classification of our test code.

4. With the training examples, labeled as ‘0’, if correct, and ‘1’, if incorrect, train the

classifier and predict the correctness of the code.

. We used again the SVM with an RBF kernel and performed an grid search to optimize its

parameters using a 11 fold cross validation, each comprehending one of the 11 disjoint sets

previously created, and tested directly the normal fusion between the flat method and each one

of the other two approaches. Therefore, cross validation is made using the minimum error count

achieved. For γ we considered the values �0.1, 0.2, … , 0.9�, for a total of 9 values, and for the

76

cost we considered E � 2/ where 0 � �0,1, … ,7�. Best (γ, C) parameters were chosen for the

minimum average error count attained considering all disjoint training subsets and 9 � �3�

nearest neighbors. A total of 72 models were trained during the grid search. Weights were kept

equal for both features.

77

Chapter 5

Results and Discussion

5.15.15.15.1 Feature extractionFeature extractionFeature extractionFeature extraction

HE 954 elements feature vector, originated by the concatenation of image descriptors

output, yielded zero values for 121 bins. This result was expected since some image

descriptors act in different color spaces while the database consists in grayscale images. Tamura

textures descriptor resulted in zero values for contrast and coarseness in all images1.

Changing the sensitivity of the box filters resulted in a more balanced amount of interest

points detected among all images. Even at a very low sensitivity, in some cases, the minimum

amount of points for an image was roughly 100. We also tried to equalize the image histograms,

to improve contrast in order to extract more points in the difficult cases but it proved to be

ineffectual. Our strategy to use between 256 and 512 points for the visual words frequency

histograms proved to be successful since it originated sparse histograms as desired. However

some visual words are very common, resulting in high frequency bins (Figure 5.1).

Figure 5.1 – Visual words frequency histogram using the SURF local descriptor.

1 Probably originated from a software bug.

0 50 100 150 200 250 300 350 400 450 5000

10

20

30

40

50

60

70

80

90

100

Visual word

Fre

qu

en

cy

Histogram of visual words

T

78

We also performed some experiences with the SIFT descriptor based in a Difference of

Gaussians (DoG) interest point detector without satisfactory results.

5.25.25.25.2 AnnotationAnnotationAnnotationAnnotation

Evaluation of the annotation results is based in the error count, accordingly to (2.1-3), and

the error rate, i.e., the percentage of codes that have at least one error in one position within one

code axis. The best RBF kernel parameterization for the flat and axis-wise approaches resulted

from the extensive grid search conducted. Overall results for this grid search are depicted in

Figures 5.2-6, with the values of γ and C kernel parameters transformed by a natural logarithm

for better visualization. Highlighted values in all figures correspond to the parameters that

yielded the highest accuracy rate. These are summarized in Table 5.1.

Figure 5.2 – Grid search results for the flat approach.

The grid search results for the flat approach returned a value of γ in the limit of all values

tested. Hoping for a slightly better accuracy we further experimented decreasing this value.

-10-9 -8

-7-6

-5-4

-4

-2

0

2

4

6

0

50

100

Log(γγγγ)

Flat approach

X: -9.21

Y: 2.773

Z: 88.84

Log(C)

Accu

racy(%

)

79

Unfortunately the accuracy drops immediately. Perhaps such behavior can be explained by the

decreasing slope in the inferior left corner of the grid search plane, which reveals an accuracy

drop trend for smaller values of γ.

Figure 5.3 – Grid search results for the axis-wise approach (Technique axis).

Figure 5.4 – Grid search results for the axis-wise approach (Direction axis).

-10 -9-8

-7-6

-5-4

-4

-2

0

2

4

6

40

60

80

100

Log(γγγγ)

Axis-wise approach - Technique axis

X: -7.958

Y: 1.386

Z: 99.67

Log(C)

Accu

racy (

%)

-10-9 -8

-7-6

-5-4

-4

-2

0

2

4

6

20

40

60

80

100

Log(γγγγ)

Axis-wise approach - Direction axis

X: -7.958

Y: 1.386

Z: 90.04

Log(C)

Accu

racy (

%)

80

Figure 5.5 – Grid search results for the axis-wise approach (Anatomy axis).

Figure 5.6 – Grid search results for the axis-wise approach (Biology axis).

-10-9 -8

-7-6

-5-4

-4

-2

0

2

4

6

20

40

60

80

100

Log(γγγγ)

Axis-wise approach - Anatomy axis

X: -7.958

Y: 1.386

Z: 92.97

Log(C)

Accu

racy (

%)

-10 -9-8

-7-6

-5-4

-4

-2

0

2

4

6

40

60

80

100

Log(γγγγ)

Axis-wise approach - Biology axis

X: -7.958

Y: 1.386

Z: 99.04

Log(C)

Accu

racy (

%)

81

From the grid search performed for the axis-wise approach, we realized that classification in

the Direction (D) and Anatomy (A) axes is more troublesome than in the Technique (T) and

Biology (B) axes.

Flat

Axis-wise

Technique Direction Anatomy Biology

Gamma (γ) 0.0001 0.00035 0.00035 0.00035 0.00035

Cost (C) 16 4 4 4 4

Accuracy (%) 88.8 99.7 90.4 93,0 99,0

Table 5.1 – Best parameters for the RBF kernel according to the flat and axis-wise

methods.

Before this grid search we wanted to verify if the BoW model could provide some additional

accuracy when concatenated to the remaining global image descriptors. Nowak [69] states that

BoW models based on local image descriptors around interest points perform worse than those

based on a dense point sampling. Indeed, some tests using only our BoW model provided poor

accuracy during cross validation with an RBF kernel empirical configuration. However,

experiments using only global image descriptors in a grid search did not achieve higher results

for accuracy than those presented in Table 5.1. The decision to use the RBF kernel is grounded

in worse results achieved by other type of kernels, namely the linear, polynomial and sigmoid

during preliminary experiences with the data.

The accuracy values in Table 5.1 do not tell us anything about the error count. There is no

straightforward relationship between accuracy and the error count. Even if we misclassify few

codes they can be severely penalized according to the error evaluation schema. Therefore, with

the best parameters, we proceeded as described in section 4.1.2 and evaluated the average error

count in the training set considering 11 fold cross validation. Remember that for the position-

wise approach the RBF kernel parameterization used is identical to the axis-wise strategy.

Considering the accuracy returned by the grid search a better performance was expected for the

flat method. Even if the axis-wise strategy can outperform the flat annotation in terms of

accuracy for each axis, this method suffers from error propagation. By multiplying all

probabilities the overall accuracy for a completely correct predicted code is 82.3%, less than the

88.8% attained for the flat annotation. We did not know how the position-wise annotation

would perform at this point. Afterwards we annotated our test set too. Results are shown on

table 5.2.

82

Training Set Test Set

Error count Error rate (%) Error count Error rate (%)

Flat 30.8 11.9 31.4 13.3

Axis -wise 32.7 14.5 37.2 16.6

Position-wise 36.3 16.2 39.9 17.4

Table 5.2 – Error evaluation for all strategies considered.

The flat method eventually outperformed its counterparts both in error count and error rate.

This was also verified in several related works in Chapter 2. To analyze the differences between

all classifiers we computed the percentage of images correctly classified by all three methods,

by two methods but not a third, by only one of the methods and by none. Table 5.3 summarizes

our findings for the second and third cases. Examples of images for the results in Table 5.3 are

presented in Figure 5.7.

Flat Axis -wise Position-wise

Flat 4.2% 3.1% 0.8%

Axis -wise 3.1% 0.3% 1.4%

Position-wise 0.8% 1.4% 1,8%

Table 5.3 – Percentage of test images correctly classified by only one method (diagonal cells),

and by two methods but not a third.

The percentage of images correctly classified by all three methods is 78.8% while the

percentage of images misclassified by all methods is 9.8%. From table 5.3 it is clear that a flat

annotation is more accurate where the other methods fail to classify correctly. The 9.8% of

images misclassified by all methods defines a maximum theoretical limit of 90.2% for accuracy

regarding any possible methodology to further improve the initial results. We noticed some

confusion between classes ‘1123-127-500-000’ and ‘1123-120-500-000’, responsible for

roughly 5% of misclassifications in all methods. These classes have a high inter-category

similarity. However the D3

identical to the first code (Figure 5.

.

Figure 5.7 – Examples from images

by all approaches,

correctly only by the axis

position-wise method,

only by the axis

a)

d)

position is unspecified in the second case and may be actually

(Figure 5.8)

Examples from images a) classified correctly by all approaches,

by all approaches, c) classified correctly only by the flat method,

correctly only by the axis-wise method, e) classified correctly only by the

wise method, f) misclassified only by the flat method,

only by the axis-wise method, h) misclassified only by the position

b)

e)

g) h)

83

position is unspecified in the second case and may be actually

classified correctly by all approaches, b) misclassified

classified correctly only by the flat method, d) classified

classified correctly only by the

misclassified only by the flat method, g) misclassified

misclassified only by the position-wise method.

c)

f)

Figure 5.8 – Examples from classes ‘1123

Confusion between these two classes comprehends roughly 5% of

misclassification

5.35.35.35.3 Semantic meaningless codesSemantic meaningless codesSemantic meaningless codesSemantic meaningless codes

Another result is that when the IRMA code is assembled by the axis

methods, the misclassification of a particular axis can produce a code that does not have any

representation in the 116 possible classes. We named it a “semantic mean

Figure 5.7 a), the classification returned meaningless IRMA codes for both axis

approaches. The importance of these codes is that we are sure that they possess some error.

Meanwhile, experiments to find which axis or axes were miscl

results. We attempted to use the error minimum distance between meaningless codes and all 116

classes and weight by class frequency in the training set with no success. We figured that there

is some correlation between the meani

method. The flat method does not suffer from this kind of misclassification, but it seems that

images producing meaningless codes are “hard” to classify with this strategy. For instance,

45.2% of meaningless codes returned from the axis

misclassifications by the flat annotation. By identifying this correspondence the flat error rate

and error count would decrease to 11.9% and 23.82 respectively. In the case of the positi

method these values would decrease to 12.7% and 26.9, as only 25.6% of meaningless codes

correspond to a wrong classification in the flat annotation. Also, correcting these codes in both

axis-based approaches would

of all annotations (Table 5.4).

misclassifications in the flat approach is clear by the difference of the number of wildcards

assigned considering the two decision rules for maj

Examples from classes ‘1123-127-500-000’ (left) and ‘1123-120


misclassifications in all approaches.

Semantic meaningless codesSemantic meaningless codesSemantic meaningless codesSemantic meaningless codes

Another result is that when the IRMA code is assembled by the axis-wise and position


representation in the 116 possible classes. We named it a “semantic mean

Figure 5.7 a), the classification returned meaningless IRMA codes for both axis


Meanwhile, experiments to find which axis or axes were misclassified produced no useful



is some correlation between the meaningless codes detected and misclassified codes in the flat



aningless codes returned from the axis-wise annotation correspond to


and error count would decrease to 11.9% and 23.82 respectively. In the case of the positi



based approaches would decrease the two error measures as it corresponds to

(Table 5.4). This relationship between meaningless codes and


assigned considering the two decision rules for majority voting.

84

120-500-000’ (right).


wise and position-wise


representation in the 116 possible classes. We named it a “semantic meaningless” code. In

Figure 5.7 a), the classification returned meaningless IRMA codes for both axis-based


assified produced no useful



ngless codes detected and misclassified codes in the flat



wise annotation correspond to


and error count would decrease to 11.9% and 23.82 respectively. In the case of the position-wise



s it corresponds to roughly 4%

This relationship between meaningless codes and


85

5.45.45.45.4 FusionFusionFusionFusion

The NFS fuses IRMA codes by majority voting, between pairs of methods. Here we tested

the decision rule of ignoring, or not, meaningless codes. We also evaluated the number of

wildcards generated. The results are in Table 5.4.

Normal Fusion Replace Meaningless

Error

count

Error

rate

(%)

Wildcards

(“*”)

Error

Count

Error

rate

(%)

Nm Wildcards

(“*”)

Tra

in Flat/Axis-wise 26.9 16.1

-

28.3 14.6 36

- Flat/Position-wise 27.7 18.6 27.8 15.7 47

Axis-wise/Position-wise 30.6 18.4 - - -

Te

st Flat/Axis-wise 29.1 18.3 352 30.1 16.0 42 219

Flat/Position-wise 29.4 20.6 481 28.6 17.4 43 310

Axis-wise/Position-wise 34.9 20.0 296 - - - -

Table 5.4 – Results for the NFS system. Nm is the number of meaningless codes detected for the

axes-based method involved.

In fact, we also tested fusion between all three methods, resulting in a error count of 33.7 and

an error rate of 15.7%. This result is not unexpected since all correct annotations by one method

that were misclassified for the other two are lost (summing the diagonal cells in Table 5.3).

From the Table 5.4 it is seen that we did not performed normal fusion between the axis-

based methods using meaningless codes replacement. This decision is grounded in the fact that

both methods share 22 images with meaningless codes. Then, with a meaningless code detected,

we have a high probability of replacing a meaningless code for another meaningless code. Also,

normal fusion between them did not provided lower error measure than the flat annotation.

Results in Table 5.4 show that there is a decrease of the error count but an increase of error rate.

From here we can conclude that we lose accuracy at lower hierarchical positions but gain error

count by assigning wildcards to wrong hierarchical superior positions. Results were consistent

with the predictions.

The SVMs trained in this work make use of probability estimates. In the previous NFS

system we disregarded this information. Some tests training a probability threshold for wildcard

assignment comprehending all IRMA codes did not result in any significant gain. Some images

with high probability output for a class, axis or position, are incorrectly classified. Others, with

a low probability output, were annotated correctly. Therefore, inferring a decision rule from the

probability distribution did not work. First let us look at the two-dimensional feature space

generated by our pair of variables: probability and the average 3-NN distance (Figures. 5.9-10).

86

Figure 5.9 – Feature spaces for two distinct IRMA codes. For low representation (left) the

separation between classes is better that highly represented codes (right) in

training.

Figure 5.10 – Feature spaces for two distinct IRMA codes, one completely labeled as correct

(left) and one labeled only as incorrect. There are classes of codes completely

misclassified during the training phase.

In Figure 5.9 (left), for low representation, the two elements of the feature vector possess

some information able to separate both classes, where the distribution of wrong classes lay in a

low probability boundary. However, some correct codes also lay in this region. In Figure 5.9

(right) this is not very clear, with both classes mixed at high probability and different average

distances. We trained a second SVM with an RBF Kernel in a grid search to find the optimal

parameters that return the lowest error count when performing fusion. Results are depicted in

Figures 5.11-12.

0.2 0.25 0.3 0.35 0.4 0.45 0.50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9Smart Fusion Feature Space - IRMA: 1121-230-961-700

Average 3-NN distance

SV

M P

rob

ab

ilit

y O

utp

ut

Correct

Incorrect

0 0.1 0.2 0.3 0.4 0.5 0.6 0.70.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1Smart Fusion Feature Space - IRMA: 1123-110-500-000


SV

M P

rob

ab

ilit

y O

utp

ut

Correct

Incorrect

0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.80.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1Smart Fusion Feature Space - IRMA: 1123-110-500-000


SV

M P

rob

ab

ilit

y O

utp

ut

Correct

0.2 0.25 0.3 0.35 0.4 0.450.09

0.1

0.11

0.12

0.13

0.14

0.15

0.16

0.17


SV

M P

rob

ab

ilit

y O

utp

ut

Smart Fusion Feature Space - IRMA: 1121-240-438-700

Incorrect

87

Figure 5.11 – Grid search for the flat/axis-wise fusion. Best parameters are γ=0.9 and C=128.

Figure 5.12 – Grid search for the flat/position-wise fusion. Best parameters are γ=0.7 and

C=128.

01

23

45

-2.5

-2

-1.5

-1

-0.5

0

27.5

28

28.5

29

29.5

30

X: 4.852

Y: -0.1054

Z: 27.82

Log(C)

Axis-wise smart fusion grid search

Log(γγγγ)

Err

or

Co

un

t

01

23

45

-2.5

-2

-1.5

-1

-0.5

0

27

28

29

30

31

X: 4.852

Y: -0.3567

Z: 27.34

Log(C)

Position-wise smart fusion grid search

Log(γγγγ)

Err

or

Co

un

t

88

The best parameterization was found at the extreme values tested. This would normally

require a more wide selection of parameters to be tested. Even so, we have used both

configurations and tested their performance (Table 5.5).

Table 5.5 – Results for the SFS system.

The strategy used in the SFS system resulted in a marginal improvement of the error count

but a more significant improvement in the error rate when comparing with the NFS system. A

total of 70 codes are stated as incorrect for flat/axis-wise fusion. From these only 43 correspond

to true positives. In the case of the flat/position-wise fusion the number of incorrect codes

detected is also 70 but with 50 true positives. While in both methods some false positives are

merged, which may result in more error, fusion of true positives yields a higher gain. The

number of wildcards is less than the NFS system. While fusion comprehends fewer codes, only

7% of the data, the discrepancy between codes involved is higher.

Smart Fusion

Error

count

Error rate

(%) Wildcards (“*”)

Flat/Axis-wise 29.0 14.4 135

Flat/Position-wise 28.3 14.9 180

89

Chapter 6

Conclusions and Future Work

N this work we addressed the medical image annotation problem and explored fusion

strategies between all methods involved. Standalone results show that annotation

considering the conceptualization of the image in a single class, disregarding the IRMA code,

works better than to explore the nature of the code by means of separation in its constituent axis

or even positions. This result is in line with related works. Methods involving separate axis

classification are prone to error propagation.

Benchmarking our results with related works considering the same database places our SFS

system close to the state of the art, with an error count of 26.8 (see Table 2.4 for more

comparisons). However our work involves different assumptions: the test set is not used to build

the bag-of-words model and it is based in interest points instead of a dense point sampling.

The image descriptors used during the several classification stages are the same and we only

divide the feature space accordingly to the concepts we want to annotate. The choice of the best

image descriptors comprehending each stage will be explored future work. Here it is possible to

add new image descriptors. Also, the weight of the elements in the image descriptor for SVM

classification is the same. This is particularly important in the bag-of-words model, where all

words have equal weight. Common words, with high frequency, are not good to discriminate

classes. Therefore, the application of term frequency – inverse document frequency to the

histogram of visual words is also a possibility to test in future work.

Our position-wise method underperformed when compared with the other two methods.

Nevertheless, it shows interesting results during the fusion methods with the flat method. The

error propagation is large in this method but we are aware that the RBF kernel parameterization

used is not optimal. Searching for an optimal parameterization for each stage seems

cumbersome but can also be addressed in the future. It would also be interesting to develop a

decision rule that allowed the classifier to step back the top-down tree classification due to the

possibility of an error.

Different classification strategies can also be applied. Aside the flat and axis-wise methods

we could also group the database classes in larger concepts comprehending 2 or 3 axis. To avoid

meaningless codes a semantic reduction for the classification of the remaining grouped axis can

take place. For the position-wise method we could also consider all different top axis position

I

90

configurations as a class of objects. A high performance in this stage would lead to less error

committed during the annotation of subsequent positions.

The fusion schemas implemented led to better results but at a higher error rate. This means

that the even if more errors are being considered due to the wildcard assignment these are of

lesser importance. While the NFS method considers a simple majority voting, the SFS method

can be further explored. Detecting possible misclassifications is a complex problem since the

SVM output consists only in a label. Although results show that a careful fusion is possible

there is still too much things involved that we would like to address carefully in the future since

the classification in this fusion method involves ordinal data, due to the SVM probability

output, and at the same time categorical data. We could also try to use as input features in this

stage similarity measures between the classified class and the nearest classes instead of the

average error distance.

One of the most important finding were the meaningless codes. They are easy to find and do

not require knowledge of the true image annotations. However we could not implement any

methodology to detect which axis were wrongly classified. The relationship between the

meaningless codes and the flat method misclassifications can be extremely useful to develop a

new fusion process. Moreover, meaningless codes reveal that the IRMA code is not axis-

independent as stated. There are relationships between IRMA code axis that can be target of

future work.

In the future we also would like to evaluate our methodologies in the more complex 2008

Medical Image Annotation task database in order to test if the conceptualization of the image

content smaller concepts pays off in the case of unbalanced training/test examples distribution.

Future databases involving more image modalities would be interesting to work with as well.

As a final remark the methodologies presented here are not exclusive to medical images

databases. They can be used with any database with a hierarchical annotation standard.

91

References

[1] E.A. Krupinski, “The importance of perception research in medical imaging”, Radiation

Medicine 18, (6), 2000.

[2] H.K. Huang, “PACS and imaging informatics: basic principles and applications”, John Wiley

& Sons Inc.: 25, 2004.

[3] H. Müller, N. Michoux, D. Bandon and A. Geissbuhler, “A review of content-based image

retrieval systems in medicine - clinical benefits and future directions,” International Journal

of Medical Informatics,vol. 73, no. 1, pp. 1–23, 2004.

[4] X. Zhou, A. Depeursinge and H. Müller, “Hierarchical classification using a frequency-based

weighting and simple visual features,” Pattern Recogn. Lett., vol. 29, no. 15, pp. 2011–

2017, 2008.

[5] http://archive.nlm.nih.gov/pubs/antani/icvgip02/icvgip02.php

[6] Y. Rui, T. Huang and S.F. Chang, “Image retrieval: Current techniques, promising directions

and open issues”, J. Visual Commun. Image Represent. 10, 1, 39–62, 1999.

[7] M. Kimura, M. Kuranishi, Y. Sukenobu, H. Watanabe, S. Tani, T. Sakusabe, T. Nakajima, S.

Morimura and S. Kabata, “JJ1017 committee report: image examination order codes –

standardized codes for image modality, region, and direction with local expansion: an

extension of DICOM”, Journal of Digital Imaging,15(2), 106-13,2002.

[8] B. Smith, “Beyond Concepts: Ontology as Reality Representation”, in Proceedings of the

International Conference on Formal Ontology and Information Systems, 2004.

[9] W.G. Stock and S. Schmidt, “Collective Indexing of Emotions in Images. A Study in

Emotional Information Retrieval”, Journal of the American Society for Information Science

and Technology 60(5), S. 863-876, 2009.

[10] P.B. Heidorn, “Image retrieval as linguistic and non-linguistic visual model matching”,

Library Trends, S. 309. 54, Chen & Rasmussen, 1999.

[11] W.D. Bidgood, “The SNOMED DICOM microglossary: controlled terminology resource for

data interchange in biomedical imaging”, Methods Inf. Med. 37, (4-5), 404-14, 1998.

[12] M.O. Güld, M. Kohnen, D. Keysers, H. Schubert, B. Wein, J. Brednoand and T. M. Lehmann,

“Quality of dicom header information for image categorization,” in Intl. Symposium on

Medical Imaging, ser. Proc. SPIE, vol. 4685, San Diego, CA, pp. 280–287, 2002.

[13] D.A. Forsyth, “Computer vision tools for finding images and video sequences”, Library

Trends, Vol. 48, No. 2, pp. 326-355, 1999.

[14] G. A. Seloff,”Automated access to the NASA-JSC image archives”, Library Trends, 38(4),

682-696, 1990.

92

[15] S.K. Chang and A. Hsu, “Image Information Systems: Where do we go from here?”, IEEE

Transactions on Knowledge and Data Engineering, 4, 431-442, 1992.

[16] E. Panofsky, “Meaning in the Visual Arts”, Doubleday Anchor Books, Garden City, NY,

1955.

[17] Shatford, S., “Analyzing the subject of a picture: a theoretical approach”, Cataloguing and

Classification Quarterly, 6(3), 39–62, 1986.

[18] S. Shatford-Layne, “Some issues in the indexing of images”, Journal of the American

Society of Information Science, 45(8), 583-588, 1994.

[19] P.G.B. Enser, “Pictorial information retrieval”, Journal of Documentation, 51(2), 126-170,

1995.

[20] http://lmb.informatik.uni-freiburg.de/research/completed_projects/isearch/ research.en.html

[21] P. Aigrain, “Organizing Image Banks for Visual Access: Model and Techniques”, OPTICA’87

Conf. Proc., pp.257-270, Amsterdam, 1987.

[22] T. Kato, T. Kurita, N. Otsu and K. Hirata “A Sketch Retrieval Method for Full Color Image

Database – Query by visual example”, Proc. ICPR, Computer Vision and Applications,

pp.530-533, 1992.

[23] J. Eakins and M. Graham, “Content based image retrieval”, JISC Technology Applications

Program, Report 39, 1999.

[24] M. Flickner, H. Sawhney, W. Niblack, J. Ashley, Q. Huang, B. Dom, M. Gorkani, J. Hafner, D.

Lee, D. Petkovic and D. Steele, “Query by image and video content: The QBIC system”,

Computer, 28(9), 23-32, 1996.

[25] P. Aigrain, H. Zhang and D. Petkovic, “Content-Based Representation and Retrieval of

Visual Media: A State of the Art Review”, Multimedia Tools and Applications, Vol. 3, No. 3.

pp. 179-202, 1996.

[26] A. Smeulders, M. Worring, S. Santini, A. Gupta and R. Jain, “Content Based Image Retrieval

at the End of the Early Years”, IEEE Transactions on Pattern Analysis and Machine

Intelligence, Vol. 22, No. 12, 2000.

[27] R. Datta, D. Joshi, J. Lim, James and Z. Wang, “Image retrieval: Ideas, influences, and

trends of the new age”, ACM Computer Surveys, (39), 2007.

[28] C. Harris and M.J. Stephens, ”A combined corner and edge detector”, In Alvey Vision

Conference, pp 147–152, 1988.

[29] T. Lindeberg, “Detecting salient blob-like image structures and their scales with a scale-

space primal sketch: a method for focus of attention”, International Journal of Computer

Vision 11 (3): pp 283–318, 1993.

[30] D. Lowe, “Distinctive Image Features from Scale-Invariant Keypoints”, International

Journal of Computer Vision, 60, 2, pp. 91-110, 2004.

93

[31] K. Mikolajczyk and C. Schmid, "A performance evaluation of local descriptors", IEEE

Transactions on Pattern Analysis and Machine Intelligence, 10, 27, pp 1615--1630, 2005.

[32] L. Fei-Fei, R. Fergus, and P. Perona, “A Bayesian Approach to Unsupervised One-Shot

Learning of Object Categories,” Proc. IEEE Int. Conf. Computer Vision, 2003.

[33] S. Sedghi, M. Sanderson and P. Clough, “A study on the relevance criteria for medical

images”, Pattern Recognition Letters, 29, pp. 2046-2057, 2008.

[34] A. Cawkell, “Indexing collections of electronic images: A review”, British Library Research

Review, 15, 1993.

[35] P.G.B. Enser, “Towards a comprehensive review of the semantic gap in visual image

retrieval”, Lecture Notes on Computer Science, vol. 2728/2003, 163-168, 2003.

[36] L. R. Long, S. Antani, T. M. Deserno, and G. R. Thoma, “Content based image retrieval in

medicine: retrospective assessment, state of the art, and future directions”, Int J Healthc

Inf Syst Inform, vol. 4, no. 1, pp. 1–16, 2009.

[37] Z. Xue, L.R. Long, S. Antani, J. Jeronimo and G.R. Thoma, “A Web accessible content-based

cervicographic image retrieval system”, in Proceedings of the SPIE Medical Imaging, 6919,

2008.

[38] W. Hsu, S. Antani and L.R. Long, “SPIRS: A Framework for Content-based Image Retrieval

from Large Biomedical Databases”, in Proceedings of the MEDINFO, 12(1), 188-92.

[39] T.M. Deserno, M.O. Güld, B. Plodowski, K. Spitzer, B.B. Wein, H. Schubert,H. Ney and T.

Seidl, “Extended query refinement for medical image retrieval”, Journal of Digital

Imaging; online-first, DOI 10.1007/s10278-007-9037-4, 2007.

[40] S. Antani, T.M. Deserno, L.R. Long, M.O. Güld, L. Neve and G.R. Thoma, “Interfacing global

and local CBIR systems for medical image retrieval”, In Proceedings of the Workshop on

Medical Imaging Research (Bildverarbeitung fur die Medizin), 166-71, 2007.

[41] T.M. Lehmann, M.O. Güld, C. Thies, B. Fisher, K. Spitzer, D. Keysers et.al., “ Content Based

Image Retrieval In Medical Applications”, Methods of Information in Medicine, 43, 354-61,

2004.

[42] H. Müller, N. Michaux, D. Bandon and A. Geissbuhler, “A review of content-based image

retrieval systems in medical applications: Clinical benefits and future directions”, 2007.

[43] C. Akgül, D. Rubin, S. Napel, C. Beaulieu, H. Greenspan and B. Acar, “Content Based Image

Retrieval in Radiology: Current Status and Future Directions”, Journal of Digital Imaging,

[Epub ahead of print], 2010

[44] T. M. Lehmann, H. Schubert, D. Keysers, M. Kohnen and B. B. Wein,“The irma code for

unique classification of medical images,” in Medical Imaging Volume 5033 of SPIE

Proceedings, pp. 109–117, 2003.

[45] T. Tommasi, B. Caputo, P. Welter, M. O. Güld and T. M. Deserno, “Overview of the clef

2009 medical image annotation track,” in Proceedings of the 9th CLEF workshop 2009, ser.

Lecture Notes in Computer Science (LNCS), Corfu, Greece, September 2009.

94

[46] “Tutorial on Medical Image Retrieval - IRMA”, Medical Informatics Europe, 2005.

[47] P. Clough, H. Müller, T. Deselaers, M. Grubinger, T. M. Lehmann,J. Jensen and W. Hersh,

“The CLEF 2005 cross-language imageretrieval track,” in Working Notes of the 2005 CLEF

Workshop, Vienna, Austria, 2005.

[48] H. Müller, T. Deselaers, T. M. Lehmann, P. Clough, E. Kim and W. Hersh, “Overview of the

ImageCLEFmed 2006 medical retrieval and medical annotation tasks,” in CLEF 2006

Proceedings, ser. Lecture Notes in Computer Science (LNCS), vol. 4730. Alicante, Spain:

Springer, 2007, pp. 595–608.

[49] T. Deselaers, T. M. Deserno and H. Müller, “Automatic medical image annotation in

ImageCLEF 2007: Overview, results, and discussion”, Pattern Recognition Letters, vol. 29,

no. 15, pp. 1988–1995, 2008.

[50] T. Deselaers and T. Deserno, “Medical image annotation in imageclef 2008,” in CLEF

Workshop 2008: Evaluating Systems for Multilingual and Multimodal Information Access,

Aarhus, Denmark, September, 2009.

[51] J.E.E. Oliveira, A.P.B. Lopes, G. Camara-Chavez, A .de Araujo and T.M. Deserno,

“MammoSVD: A content-based image retrieval system using a reference database of

mammographies”, Computer-Based Medical Systems, 2009. CBMS 2009. 22nd IEEE

International Symposium on 2009.

[52] H. Pourghassem and H. Ghassemian, “Content-based medical image classification using a

new hierarchical merging scheme”, Comput Med Imaging Graph 2008; (Draft), 2008.

[53] E. Dougherty, “Electronic Imaging Technology”, Technology & Engineers, 1999.

[54] http://plato.stanford.edu/entries/color/

[55] http://www.colour.org

[56] J. Coggins, “A Framework for Texture Analysis Based on Spatial Filtering,” Ph.D. Thesis,

Computer Science Department, Michigan State University, East Lansing, Michigan,1982.

[57] M. Tuceyrn and A. Jain, “Texture Analysis”, The Handbook of Pattern Recognition and

Computer Vision (2nd Edition), pp. 207-248, World Scientific Publishing Co., 1998.

[58] R.M. Haralick, K. Shanmugam and I. Dinstein, “Textural features for image classification,”

IEEE Transactions on Systems, Man, and Cybernetics, SMC-3, pp. 610-621, 1973.

[59] T. Sikora, “The mpeg-7 visual standard for content description-an overview,” vol. 11, no. 6,

pp. 696–702, June 2001.

[60] Tamura, H., S. Mori, and Y. Yamawaki, “Textural Features Corresponding to Visual

Perception,” IEEE Transactions on Systems, Man, and Cybernetics, SMC-8, pp. 460-473,

1978.

95

[61] D. Park, Y. Jeon and C. Won, “Efficient use of local edge histogram descriptors”,

Proceeding of the 2000 ACM workshops on Multimedia, 51-54, 2000.

[62] S. A. Chatzichristofis and Y. S. Boutalis, “CEDD: Color and edge directivity descriptor: A

compact descriptor for image indexing and retrieval.” in ICVS, ser. Lecture Notes in

Computer Science, A. Gasteratos,M. Vincze, and J. K. Tsotsos, Eds., vol. 5008. Springer, pp.

312–322, 2008.

[63] S. Chatzichristofis and Y. Boutalis, “FCTH: Fuzzy color and texture histogram-a low level

feature for accurate image retrieval,” in Proceedings of the 9th International Workshop on

Image Analysis for Multimedia Interactive Services, WIAMIS, pp. 191–196, 2008.

[64] A. Oliva and A. Torralba, “Modeling the shape of the scene: A holistic representation of

the spatial envelope,” International Journal of Computer Vision, vol. 42, no. 3, pp. 145–

175, 2001.

[65] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool, “Surf: Speeded up robust features,”

Computer Vision and Image Understanding, vol. 110, no. 3, pp. 346–359, 2008.

[66] B. Lovell and C. Walder, “Support Vector Machines for Bussiness Applications”, Business

Applications and Computational Intelligence, Idea Group Publishers, 2006.

[67] V. Vapnik, “The Nature of Statistical Learning Theory”. New York: Springer, 1995.

[68] C-C. Chang and C-J Lin, “LIBSVM: a library for support vector machines”, Software

available at http://www.csie.ntu.edu.tw/~cjlin/libsvm, 2001

[69] E. Nowak, F. Jurie and B. Triggs, “Sampling strategies for bag-of-features image

classification. In: Proc. Eur. Conf. on Computer Vision, vol. 4, pp. 490–503, 2006.

Documents

Content-Based Image Retrieval for Medical Applicationsinescporto.pt/~jsc/students/2010IgorAmaral/2010... · 1.1.3 Concept-based retrieval limitations: the road to CBIR 4 1.2 Content