28
Draft (for Attention and Performance XVI ): July 7, 1994. Please do not quote An Architecture for Rapid, Hierarchical Structural Description John E. Hummel and Brian J. Stankiewicz University of California, Los Angeles Running Head: Structural Description Address correspondence to: John E. Hummel UCLA, Department of Psychology 405 Hilgard Ave. Los Angeles, CA 90024-1563  [email protected]

H&S96, Distrib Vers

Embed Size (px)

Citation preview

Page 1: H&S96, Distrib Vers

8/6/2019 H&S96, Distrib Vers

http://slidepdf.com/reader/full/hs96-distrib-vers 1/28

Draft (for Attention and Performance XVI ): July 7, 1994. Please do not quote

An Architecture for Rapid, Hierarchical Structural Description

John E. Hummel and Brian J. Stankiewicz

University of California, Los Angeles

Running Head: Structural Description

Address correspondence to:

John E. HummelUCLA, Department of Psychology

405 Hilgard Ave.

Los Angeles, CA 90024-1563

 [email protected]

Page 2: H&S96, Distrib Vers

8/6/2019 H&S96, Distrib Vers

http://slidepdf.com/reader/full/hs96-distrib-vers 2/28

Hummel & Stankiewicz, Structural Description Page 2

Abstract

Dynamic binding is a necessary prerequisite to structural description. Synchrony

of firing has been proposed as one means for dynamic binding in neural systems, but it is

not straightforward to represent structural descriptions suitable for object recognition

given the limitations of synchrony for binding (specifically, limitations of speed andreliability). This paper presents a model of object recognition addressed to this problem.

The model generates hierarchical representations that function as structural descriptions

in the event of correct dynamic binding, but still support object recognition when

dynamic binding is in err. The model can recognize objects rapidly (i.e., before dynamicbinding can be established), but has the properties of a structural description when

dynamic binding is established correctly (e.g., it can recognize objects in novel

viewpoints). The model's operation and performance, and some implications of this

approach to object representation, are discussed.

Page 3: H&S96, Distrib Vers

8/6/2019 H&S96, Distrib Vers

http://slidepdf.com/reader/full/hs96-distrib-vers 3/28

Hummel & Stankiewicz, Structural Description Page 3

Introduction

Among students of human object perception, it is relatively uncontroversial toposit that object recognition is mediated, at least in part, by the activation of a structural 

description (Clowes, 1967; Palmer, 1975; Sutherland, 1968; Winston, 1975), a

representation specifying an object's features or parts and their interrelations. Variants of 

this idea have been proposed to account for many aspects of human object perception,including our ability to recognize objects despite variations in viewpoint (e.g.,

Biederman, 1987; Marr & Nishihara, 1978), limitations on our capacity for viewpoint-

invariant recognition (e.g., Biederman & Gerhardstein, 1994; Hummel & Biederman,

1992), the manner in which we segment objects into parts (e.g., Biederman & Cooper,1991; Palmer, 1977), and a variety of other findings (see Barsalou, 1993; Cooper,

Biederman & Hummel, 1992; Quinlan, 1991).

A structural description is a binding of object primitives (parts or features) into

proposition-like structures specifying their interrelations. These primitive-relationbindings must be created dynamically, by binding together independent representational

units, rather than statically, by pre-dedicating separate units for each one. That is, rather

than having a separate unit for every possible part in every possible relation, it is

necessary to have units for parts that are independent of the units for their interrelations,and to represent part-relation conjunctions by dynamically binding part units to relation

units (see Hummel & Biederman, 1992). Dynamic binding is necessary because the

number of units required to statically bind parts to all possible interrelations would grow

exponentially with the number of relations, and more importantly, because static bindingsacrifices the attribute structure of the represented entities: The fact that a cylinder above

something is more similar to a cylinder below something than to slab below something is

lost in a representation where each part-relation binding is coded by a separate unit (seevon der Malsburg, 1981). This loss of attribute structure is a fundamental property of 

static binding that cannot be overcome even with sophisticated static codes, such as

Smolensky's (1990) tensor products (see Hummel & Biederman, 1992; Hummel &

Holyoak, 1993). Dynamic binding is thus a prerequisite to structural description.

The dynamic binding problem poses one of the great challenges to understanding

how the visual system might generate structural descriptions because it is not obvious

how otherwise independent neurons could dynamically bind themselves into groups.

Recently, there has been substantial interest in the idea that synchrony of firing mayprovide one mechanism for dynamic binding in neural systems (e.g., Eckhorn, Bauer,

Jordan, Brosch, Kruse, Munk, & Reitboeck, 1988; Gray & Singer, 1989; Gray, König, &

Singer, 1989). The basic idea is that neurons may fire in synchrony when they represent

elements of the same group, and out of synchrony when they represent elements of different groups. This idea has been used extensively in models of sensory segmentation

(e.g., Eckhorn, Reitboeck, Arndt & Dicke, 1990; Wang, Buhmann & von der Malsburg,

1991), and Hummel and Biederman (1990, 1992) proposed a model that uses synchrony

for structural description and object recognition. This model (called  JIM ) illustrates theproperties of structural descriptions based on dynamic binding. Like human object

recognition, recognition in JIM is invariant with translation, scale, left-right reflection,

and, to a point, rotation in depth. These invariances are result of the fact that the model

Page 4: H&S96, Distrib Vers

8/6/2019 H&S96, Distrib Vers

http://slidepdf.com/reader/full/hs96-distrib-vers 4/28

Hummel & Stankiewicz, Structural Description Page 4

represents attributes an object's shape (specifically, attributes of its geons [Biederman,

1987] and their interrelations) and pose (the geons' positions and sizes in the image)independently: shape is represented on one set of units, and pose is represented on

another, so changes in the latter do not affect the representation of the former. In turn,

this independence is possible only because the shape units can be bound to pose units

dynamically. JIM's capacity for viewpoint invariance is thus a consequence of itscapacity for dynamic binding1 (see Hummel & Biederman, 1992).

Although this work is suggestive, we are still far from understanding how the

visual system might use synchrony for dynamic binding in structural description.

Hummel and Biederman's model illustrates some advantages of dynamic binding, but itrequires synchrony to be established very rapidly and without errors (the same is true of 

other models that use synchrony to represent complex structures, e.g., Hummel &

Holyoak, 1993; Shastri & Ajjenagadde, 1993). By contrast, models of sensory

segmentation are often based on more realistic assumptions about the neural mechanismsof synchrony, but these models do not address how the resulting synchrony can be used

to represent complex structures (beyond simply noting that it provides a solution to image

segmentation). The connection between models that establish synchrony under realistic

assumptions and models that use synchrony to represent complex structures is notstraight-forward because the synchrony produced by realistic models does not conform to

the assumptions about synchrony on which models such as Hummel and Biederman's are

based. Thus, it remains unclear how synchrony can be used to represent complex

structures under realistic assumptions about the manner in which it is established. Thisissue is important because the limitations of dynamic binding necessarily constrain the

use of those bindings to represent complex structures. These constraints pose a

substantial challenge to understanding how object recognition might realistically beaccomplished by structural description.

Constraints on Dynamic Binding for Structural Description

Time: In contrast to static binding, dynamic binding is a process, and therefore

takes time. For example, in the case of synchrony, some process must actively get unitsfiring in and out of synchrony with one another. Moreover, in the case of synchrony,

time is required not only to establish the bindings, but also to represent  them once they

are established: If separate groups of units are to fire out of synchrony with one another,

then their firing must be distributed over time. In the context of object recognition, thetime required for dynamic binding is problematic. Face recognition in the macaque is

accomplished to a high degree of certainty based on the information in the first set of 

spikes to reach inferior temporal cortex (Oram & Perrett, 1992; Tovee, Rolls, Treves, &

Bellis, 1992). Clearly, the macaque visual system recognizes faces without waitingaround for several sets of desynchronized spikes. To the extent that face recognition

differs from the recognition of non-face objects (see, e.g., Farah, 1992), the implications

of these findings for structural description are only indirect. More direct evidence comes

1 The viewpoint-sensitivies evidenced by this model, such as sensitivity to rotation in the picture plane,

reflect the primitives (i.e., the specific vocabulary of geon attributes and relations) on which recognition is

based, not the manner in which those primitives are bound into structures.

Page 5: H&S96, Distrib Vers

8/6/2019 H&S96, Distrib Vers

http://slidepdf.com/reader/full/hs96-distrib-vers 5/28

Hummel & Stankiewicz, Structural Description Page 5

from experiments in which human subjects viewed pictures of common (non-face)

objects. Using Rapid Serial Visual Presentation (RSVP), Intraub (1981) found thathuman subjects could recognize objects presented at the rate of ten per second. Even if 

we assume that object recognition is "pipelined", so that the early stages are performed on

one image while later stages are performed on others, this finding suggests that no stage

of object recognition can require more than about 100 ms. Similarly, Cooper andBiederman (in preparation) found that recognition was reliably better than chance when

line drawings were presented for only 28 ms and followed by a patterned mask.

Although these findings does not obviate time-consuming processes such as synchrony,

they place strong constraints on just how time-consuming they can be.

The temporal requirements of dynamic binding are a special case of a general

problem confronting all models of object recognition. All models posit processes that

map early visual representations into representations suitable for object recognition.

Usually, the processes serve to discount image variations due to viewpoint. In structuraldescription models (e.g., Bergevin & Levine, 1993; Dickinson, Pentland & Rosenfeld,

1992; Hummel & Biederman, 1992), the processes generate structural descriptions from

2D images. In other models the processes perform transformations that place viewed

images into a common frame of reference with stored object representations so the twomay be directly compared (e.g., Lowe, 1987; Poggio & Edelman, 1990; Siebert &

Waxman, 1992; Ullman, 1989; Ullman & Basri, 1991). But in every case, the processes

in question take time2. The question of how we recognize objects both rapidly and

despite variations in viewpoint is thus an important and general one. 

Errors: A second constraint on the use of dynamic binding is the possibility of 

errors or noise. In the case of synchrony, errors are inevitable: The synchrony that has

been observed in real nervous systems is by no means tidy (see, e.g., Gray, et. al., 1989).Errors of dynamic binding can cause serious problems in the generation and

representation of a structural description. If dynamic binding is used for image

segmentation, binding local image features into larger parts such as geons, then errors of 

dynamic binding are errors of image segmentation. This is problematic becausestructural descriptions are exquisitely sensitive the manner in which an object is

segmented into parts (see Ullman, 1989). Equally important is the effect of binding

errors on an independent attribute code. If independent part and relation units are bound

into structural descriptions dynamically, then errors of dynamic binding are errors of structural description: What would be a structural description given correct dynamic

binding degenerates into a simple feature list when dynamic binding fails. In the case of 

Hummel and Biederman's JIM, these errors are not subtle. For example, without correct

dynamic binding JIM cannot distinguish a cone on top of a brick from a brick on a cone,

or a cylinder on top of a wedge (see Hummel & Biederman, 1992).

2 Some models take less time than others, but those models that provide a more complete account of the

human capacity for viewpoint invariance typically employ more time-consuming processes (see Hummel,

1994). 

Page 6: H&S96, Distrib Vers

8/6/2019 H&S96, Distrib Vers

http://slidepdf.com/reader/full/hs96-distrib-vers 6/28

Hummel & Stankiewicz, Structural Description Page 6

Summary and General Approach 

Dynamic binding makes it possible to group otherwise independent objectattributes (e.g., local image features, parts, and relations). This capacity is fundamental

to structural description, and affords viewpoint-invariance as a natural consequence.

However, the disadvantage of an independent attribute code is that it carries no binding

information at all dynamic binding fails. Inasmuch as dynamic binding takes time toestablish, such failures are especially likely in the early stages of generating a structural

description from an object's image. Given this, how can a visual system recognize

objects rapidly and generate reliable structural descriptions? We present a model of 

object recognition motivated by this question. Starting with the contours in an object'simage, the model generates a structural description specifying the object's geons and their

interrelations, and uses that description as a basis for object recognition. The model uses

synchrony to dynamically bind local image features into geons (and geons to relations),

and static binding to specify the composition (or sub-structure) of each dynamically-defined group. The resulting representation is a hybrid of a structural description and a

view (see Poggio & Edelman, 1990; Ullman & Basri, 1991) that specifies each group

both in terms of its independent shape attributes and its hierarchical composition. This

hybrid has the advantages of a structural description in the event of correct dynamicbinding, but does not degenerate into a feature list when dynamic binding fails. As a

result, the model can recognize objects both rapidly (even before synchrony is

established), and in novel viewpoints (once synchrony is established).

The Model

The model consists of six layers of units (Figure 1). As input, it takes the

contours in a line drawing of an object, and as output, it activates a single unit

corresponding to the identity of the object. Units in the input layer (Layer 1) representimage contours. Layer 2 units represent vertices formed by coterminating contours, and

axes of symmetry between pairs of contours. Contour and vertex units interact to

synchronize their outputs given evidence that they belong to the same geon, and

desynchronize their outputs given evidence that they belong to different geons. Whensynchrony is established correctly, the output of Layer 2 at any given time will represent

the 2D features of a single geon; but before synchrony is established, or when there is an

error in the synchrony, the output of Layer 2 may represent the features of more than one

geon, or even the whole object.

The model's third layer consists of  modules of units that interpret the output of 

Layer 2 (i.e., the current part3) and impose their interpretation on the units in Layer 4.This interpretation has two components. The first, the  Independent Geon Array ( IGA),

specifies the part's shape attributes and relations to other parts (Figure 1). The units inthe IGA are completely independent of one another, so they preserve the part's attribute

3  We will use the term "part" to refer to a collection of features, e.g., an instantaneous pattern of output

from Layer 2, and the term "geon" in Biederman's (1987) sense, i.e., to refer to volumes composed of 

specific features in a specific configuration. "Part" will also be used to refer to the volumetric

interpretation of a set of features. 

Page 7: H&S96, Distrib Vers

8/6/2019 H&S96, Distrib Vers

http://slidepdf.com/reader/full/hs96-distrib-vers 7/28

Hummel & Stankiewicz, Structural Description Page 7

structure, but are sensitive to errors of dynamic binding. The second component, the

substructure matrix (SSM ), specifies the part's sub-structure. Each unit in the SSMresponds to a particular shape attribute (the same vocabulary of attributes coded in the

IGA) at one of 17 positions relative to the part as a whole. That is, each unit codes a static

binding of one shape attribute to one position in a part-centered reference frame.

When the output of Layer 2 corresponds to the features of a single geon, thepattern of activation produced in the IGA will represent a dynamic binding of its shape

attributes and relations, and the pattern on the substructure matrix will represent its

substructure. The series of such patterns produced in response to each of an object's

geons constitutes a structural description of the object. When the binding is incorrect(i.e., when two or more geons fire in synchrony), the pattern produced on the IGA will

incorrectly blend the attributes of separate geons into a single representation. However,

the substructure matrix will preserve the separation of the geons' attributes because the

geons will be represented on separate parts of the matrix. In this case, the pattern on thesubstructure matrix will be a view-like representation that, albeit sensitive to the

viewpoint in which the object is depicted, is sufficient to serve as a basis for object

recognition provided the view is familiar. In general, Layers 3 and 4 generate a

representation that independently codes part-relation conjunctions with dynamic binding(the IGA), and statically binds geon attributes to part-centered coordinates (the SSM).

This hybrid of dynamic and static binding frees the model from extreme sensitivity to

errors of dynamic binding, but retains the advantages of dynamic binding in the absence

of such errors.

Page 8: H&S96, Distrib Vers

8/6/2019 H&S96, Distrib Vers

http://slidepdf.com/reader/full/hs96-distrib-vers 8/28

Hummel & Stankiewicz, Structural Description Page 8

Figure 1. The model's architecture. The model is a six-layer neural network. Units in Layer 1 represent the contours

in an object's image (heavy solid lines in the plane labeled "Layer 1"). Units in Layer 2 respond to vertices formed

where contours coterminate (heavy solid lines; Layer 2) and axes of parallel and non-parallel symmetry between pairs

of contours (broken lines; Layer 2). Layer 3 consists of modules of units (cylinders; Layer 3) with finite circular

receptive fields (light ellipses over Layers 1 and 2). Modules have three receptive field sizes distributed in a hexagonal

lattice over the model's visual field. Layer 4 consists of two parts: The independent Geon Array (IGA) contains 11units that code the shape attributes of a geon (or part) and 5 units that code its categorical relations to other parts; the

Substructure Matrix (SSM) consists of arrays of units distributed at each of 17 positions in a circular coordinate system.

Each array contains 11 units that code geon shape attributes. Layers 3 and 4 generate the hybrid structural

description/view representation on which object recognition is based. Units in Layer 5 respond to specific Geon

Feature Assemblies, patterns of activation over the IGA and SSM. Units in Layer 6 sum the outputs of Layer 5 units

over time to respond to complete objects.

Layers 5 and 6 use the output of Layer 4 as a basis for object recognition. Unitsin Layer 5 are trained to respond to specific patterns of activation on the units in Layer 4;

Page 9: H&S96, Distrib Vers

8/6/2019 H&S96, Distrib Vers

http://slidepdf.com/reader/full/hs96-distrib-vers 9/28

Hummel & Stankiewicz, Structural Description Page 9

roughly, each unit responds to a particular part in a particular set of relations, with a

particular substructure. Following Hummel and Biederman (1992), these units will bereferred to as geon feature assembly units, or GFA units, and the patterns to which they

respond will be referred to as GFAs. Each unit in Layer 6 takes its input from a

collection of GFA units, and responds to a complete object.

The theoretical focus of this work is the architecture of Layers 3 and 4. Theremainder of the model is simplified, and is intended only to serve the practical functions

necessary for exploring the architecture's properties. Layers 1 and 2 provide Layer 3 with

the raw material from which to generate a structural description, but the specific

vocabulary of features is theoretically unimportant. Similarly, Layers 5 and 6 use theoutput of Layer 4 as a basis for object classification, but the details of their operation are

not intended as theoretical claims.

Layers 1 and 2: Contours, Vertices, and Axes 

Units in the first two layers represent the contours, vertices, and axes in an

object's image, and group themselves into sets corresponding to geons by synchronizingtheir outputs. The operation of Layers 1 and 2 is taken from Hummel (1994), and is

described here only briefly. Each unit in Layer 1 represents a single contour in terms of 

its location (in the model's 140 X 140 pixel visual field), its orientation, and whether it isstraight or curved. Vertices (Layer 2) are defined where the endpoints of separate

contours meet. Each vertex is specified in terms of its location, the orientations and

curvatures of its legs, and its class. There are four vertex classes: Any vertex with two

legs is an L; a vertex with three legs such that one angle between an adjacent pair of legs

is greater than 180o is an arrow; a fork has three legs with no angles greater than 180o;

and any vertex with only one leg is a T . Axes of symmetry (Layer 2) are defined between

all pairs of contours. Each axis is defined in terms of its location, orientation, curvature,

length, width, and an index of its expansion (axes of parallelism have an expansion isnear zero, and axes of non-parallel symmetry have expansion greater than zero). The

model detects both vertices and axes automatically from the contours in an object's

image.

Grouping Units update their states in discrete time steps, t . Contour, vertex, and

axis units interact to synchronize their outputs (generate non-zero outputs at the same

time) if they belong to the same geon and desynchronize their outputs if they belong to

different geons. The state of each unit at time t is described by an input , an activation, anoutput , and a gate. An image is presented to the model by setting the inputs to the

appropriate contour units, and vertex and axis units determine their activations on the

basis of their inputs from contours. A unit's output is given by its activation multipliedby its gate. A unit's gate varies cyclically, so its output cycles between zero and a valueclose to its activation. Units synchronize their outputs by equating their gates. Following

von der Malsburg & Buhmann (1992), units desynchronize their outputs by means of a

global inhibitor, which, in the absence of signals to equate their gates, tends to push their

gates in opposite directions. Asynchrony is thus the default, and units will onlysynchronize their outputs by actively exchanging signals. Contour and vertex units

interact to synchronize their outputs if they belong to the same geon, and the global

Page 10: H&S96, Distrib Vers

8/6/2019 H&S96, Distrib Vers

http://slidepdf.com/reader/full/hs96-distrib-vers 10/28

Hummel & Stankiewicz, Structural Description Page 10

inhibitor operates to keep separate geons out of synchrony with one another. Axis units

generate non-zero outputs only when they receive input from both the contour units towhich they are connected. As a result, a geon's axes fire in synchrony with its contours

and vertices, and axes formed between the contours of separate geons will not fire at all

(except when the geons fire in synchrony with one another).

Only a few properties of the model's synchrony algorithm are important for ourcurrent purposes. First, in contrast to the algorithm proposed by Hummel and Biederman

(1990, 1992), it does not require the connections that mediate synchrony to operate any

faster than those that mediate the exchange excitation and inhibition. Rather, the speed of 

the synchrony connections relative to "ordinary" connections is a parameter that is free tovary. The behavior of the algorithm is largely the same under all values of this parameter

from one (where synchrony connections are the same speed as ordinary connections), to

eight (it has not been tested with values larger than eight). In the simulations reported

here, this parameter is set to three: Units update their gates three times each time theyupdate their activation and output once. The most notable property of the grouping

algorithm is that it segments objects into parts at matched concavities, exploiting

Hoffman & Richards' (1985) transversality regularity. But rather than having to be

"told" where the object's interior lies (Hoffman & Richards, 1985, simply assume this isknown), the algorithm uses local interactions between contour and vertex units to

discover which contour cotermination points are convex (i.e., vertices) and which are

concave (i.e., cusps). Contours excite vertices and inhibit cusps. By synchronizing

contours with the vertices to which they are connected and desynchronizing everythingelse, the model segments an object into parts at matched cusps.

Figure 2 shows the output of this algorithm on a representative run. As input, the

model was given the eight contours in the simple image in the Key. Figure 2 plots theoutput of each contour unit as a function of time. Broken curves in the graph show the

outputs of the contour units depicted as broken lines in the Key, and solid curves show

the outputs of the contours depicted as solid lines. Note that initially (i.e., at t = 1) all

eight contours generated outputs close to 1.0; initially all units "fired" in synchrony withall others. The desired output pattern is to have the broken contours fire in synchrony

with one another and out of synchrony with the solid contours. By t = 37, the contours

started to produce the desired pattern, which persisted throughout the remainder of the

run.

Figure 2 illustrates two important properties of the model's grouping algorithm.

First, all the units started in synchrony with one another, but eventually grouped

themselves as desired. There is a stochastic component to the units' gates, so the number

of iterations it takes to establish the correct groupings varies. On this run, it took about40 iterations, which is representative of the model's average performance. Second, note

that the synchrony produced is somewhat noisy. Even when the units fire in synchrony

with one another, they rarely fire in perfect synchrony. One property of the algorithm

that does not appear in the figure is that, even after the units get properly synchronizedand desynchronized, separate groups may temporarily drift back into "accidental"

synchrony with one another. The likelihood that two or more parts will drift into

accidental synchrony grows with the number of parts.

Page 11: H&S96, Distrib Vers

8/6/2019 H&S96, Distrib Vers

http://slidepdf.com/reader/full/hs96-distrib-vers 11/28

Hummel & Stankiewicz, Structural Description Page 11

Figure 2. The behavior of the grouping algorithm on one run given the contours depicted in the key as input. Broken

lines in the figure show the output of contour units depicted as broken lines in the key; solid lines in the figure

correspond to solid contours in the key.

Summary of Layers 1 and 2: Units in the first two layers represent the contours,vertices, and 2D axes in a line drawing of an object, and bind them into sets

corresponding to geons. Binding is represented as synchrony established via local

interactions between contour and vertex units. When an image is first presented to the

model, all contours -- along with the vertices and axes they activate -- will tend to firetogether whether they belong to the same geon or not. Eventually, the units synchronize

themselves into groups corresponding to geons. But even then, the synchrony is error-

prone, and separate geons will often fire in synchrony with one another, especially if theobject has many geons. As such, there is no guarantee that the output of Layers 1 and 2at any given time will correspond to a single geon. The model's third layer operates to

generate a useful representation of the part given by Layers 1 and 2, whether it

corresponds to a single geon, the whole object, or something in between.

Page 12: H&S96, Distrib Vers

8/6/2019 H&S96, Distrib Vers

http://slidepdf.com/reader/full/hs96-distrib-vers 12/28

Hummel & Stankiewicz, Structural Description Page 12

Layers 3 and 4: Modules, and Geons, Relations, and Sub-Geons 

The model's third layer characterizes the object part represented by the currentoutput of Layer 2, and imposes that characterization on Layer 4. The goal of these

operations is to generate a representation of the part's shape (on the assumption that is a

single geon), its relations to the object's other parts, and its substructure. To this end,

Layer 3 is organized into modules of units with finite, circular receptive fields. The unitswithin a module characterize the shape attributes of the geon(s) in a the module's

receptive field, and separate modules interact to characterize the structure and

substructure of whatever object part they are collectively given as input. Module

receptive fields of three sizes are distributed in a hexagonal lattice over the 140 X 140pixel visual field (see Figure 1). There are 2116 small modules (whose receptive fields

have a radius of 14 pixels, and the centers of adjacent small receptive fields are separated

by 3.50 pixels horizontally and 3.03 pixels vertically); 460 medium modules (radius 28;

7.00 by 6.06), and 110 large modules (radius 56; 14 by 12.12).

Part Representation Based on the instantaneous outputs of vertex and axisunits, each module characterizes the geon or geons within its receptive field. Each

module contains eleven units that respond independently to the various attributes of a

geon's shape. Two units code the shape of the geon's cross section (one for straight andone for curved ); two code the shape of its axis (straight or curved ); three code whether its

sides are parallel (as in a brick), non-parallel (e.g., a cone), or mixed (e.g., a wedge); and

four units code its aspect ratio (a measure of elongation, coarsely coded from round  to

very elongated ). The same eleven attributes are represented in the IGA, and in each arrayof geon attribute units in the SSM. The manner in which geon attributes are inferred

from vertices and axes is borrowed largely from Hummel and Biederman (1992), and is

not elaborated here. Each module also contains five units that code the module's

categorical relations to other active modules (described below).

Mapping from Modules to Layer 4 For any pattern of output from Layer 2 (i.e.,

any part), each module will have an interpretation of some subset of that part on its geon

attribute units; which subset a given module will have is determined by its receptive field.Given the goal of Layer 3 -- to send an interpretation of the whole part to the IGA, and an

interpretation of its substructure to the SSM -- the next problem is to determine where

(i.e., to which part of Layer 4) each module should send its interpretation of the features

in its receptive field. A module with the whole part in its receptive field should send itsoutput to the IGA, and a module with a subset of the part should send its output to the

SSM. This mapping is performed by a set of gates on the connections from Layer 3 to

Layer 4. There is one gate associated with each module. When a module has a wholepart in its receptive field (as defined below), its gate will open. A module with an opengate (an open module) will direct its own output to the IGA, and direct the outputs of 

other modules to the various parts of the SSM. For example, consider the contours and

module receptive fields in the bottom of Figure 3a. When module A is open, it will send

its own output to the independent geon array, and the outputs of B, C, and D to the SSM.Mapping from Layer 3 to Layer 4 thus becomes a problem of deciding (a) which modules

Page 13: H&S96, Distrib Vers

8/6/2019 H&S96, Distrib Vers

http://slidepdf.com/reader/full/hs96-distrib-vers 13/28

Hummel & Stankiewicz, Structural Description Page 13

(gates) to open, and (b) how each gate should control the flow of output from modules to

Layer 4. Four constraints govern this mapping.

Connectedness: The first problem is to determine whether a given module has a

whole part in its receptive field. One solution is to treat the entire set of current feature

outputs as a single part, and open any module with all those features in its receptive field

(for the sake of clarity, this is how the term "part" has been used to this point). Howeverit is possible to specify a more sophisticated constraint on when a collection of features

constitutes an object part. In natural images, parts of the same object tend to be

connected to one another, so features occupying two or more disconnected regions in the

image (e.g., Figure 3b), are unlikely to belong to the same object. By this connectedness

constraint , a module will open its gate whenever it has a complete set of  connected  

features in its receptive field. For example, there is one set of connected features (one

part) in Figure 3a, and only module A has the whole part in its receptive field; modules

B, C, and D each have various subsets of it. Therefore, A will open, but B - D will not.By contrast, the image in Figure 3b consists of two sets of connected features, so two

modules (B and C) satisfy connectedness with respect to this image. When a module

satisfies connectedness, it will open and send its output to the IGA. In Figure 3a, only A

opens; in Figure 3b, B and C open, but A does not (by the smallest module constraintdescribed below).

Figure 3. The constraints governing the mapping of module outputs to Layer 4. (a) The submodule constraint (alsoconnectedness). (b) The connectedness constraint. (c) The smallest module constraint. (d) The one level constraint.

See text for details. 

Submodules: A submodule constraint governs the mapping of module outputs tothe substructure matrix. Module j is a submodule of module i iff all points in j's receptive

field are contained within i's receptive field. For example, B, C, and D in Figure 3a are

all submodules of A. When module i has an entire part in its receptive field, its active

submodules will have the subsets of that part in their receptive fields (where an active 

SubstructureMatrix

IGA

SubstructureMatrix

IGA

Substructure

Matrix

IGA

B

A

C D

Substructure

Matrix

IGA

A

B

A

A

B

C D

B

B

D

C

(a) Submodule (b) Connectedness (c) Smallest Module (d) One Level 

B/C B A

B

D

CA

Page 14: H&S96, Distrib Vers

8/6/2019 H&S96, Distrib Vers

http://slidepdf.com/reader/full/hs96-distrib-vers 14/28

Page 15: H&S96, Distrib Vers

8/6/2019 H&S96, Distrib Vers

http://slidepdf.com/reader/full/hs96-distrib-vers 15/28

Hummel & Stankiewicz, Structural Description Page 15

iff a subset of points on j are inside ri and a subset are outside ri; and j is irrelevant to i if 

all points on j are outside ri. For example, in Figure 4a, contours 1 - 3 are consistent with

module B, 4 and 8 are inconsistent with B, and 5 - 7 are irrelevant to B. Consistent

contours tend to open a module's gate (drive its value toward 1), and inconsistentcontours tend to close it (drive its value toward 0). The second influence on Gi is the set

of all G j , where  j are submodules of  i: G j  drive Gi toward zero, enforcing the smallestmodule constraint. In Figure 4a, C is a submodule of both A and B, and B is a

submodule of A, so GC inhibits GA and GB, and GB inhibits GA. The value of a

module's gate is:

(1)

where c j t  is the output of any contour,  j , that is consistent with i, ck t  is the output of any

contour, k , inconsistent with i, and Gl t is the gate of any module l that is a submodule of i.κ is a constant (κ = 10). By (1), Gi will be 1 at time t  iff: (a) more than two contours

consistent with i are firing, (b) no inconsistent contours are firing (although any numberof irrelevant contours may fire without consequence), and (c) no submodules of  i have

non-zero gates. Modules are updated from smallest to largest on the assumption that

smaller modules receive their inputs before larger modules. Thus, although modules are

discussed as belonging to a single layer (Layer 3), Layer 3 actually consists of threelayers of units, one per module size.

Figure 4. Implementation of the mapping constraints. (a) Implementation of the constraints governing the a module's

gate (connectedness and smallest module). Gates are depicted as labeled circles (e.g., GA is the gate on module A),

Gi

t =

1 if [ c j 

 j 

∑ − 2 − κ  ck 

∑ − κ  Gl 

∑ ]> 0

0 otherwise ,

 

  

Page 16: H&S96, Distrib Vers

8/6/2019 H&S96, Distrib Vers

http://slidepdf.com/reader/full/hs96-distrib-vers 16/28

Hummel & Stankiewicz, Structural Description Page 16

and the numbers in the rectangles refer to the numbered contours depicted in the bottom of the figure (contours are

depicted as heavy, solid lines). (b) Implementation of the constraints on mapping as a function of gate states

(submodule and one level ). Arcs passing through a gate (circle) are controlled by that gate. For example, GA controls

module A's connection to the IGA and B's and C's connections to the upper center of the SSM. The small arc between

B's and C's connections to the SSM corresponds to the one-level constraint: When activation flows from module B to

the SSM (i.e., when A is open and B is active), C's connection to the SSM is disabled. 

The gates route the module outputs to Layer 4 as illustrated in Figure 4b. The netinput to the IGA (both its geon and relation units) is simply the vector sum of all open

modules' geon and relation arrays:

(2)

where gt i is the activation vector on both the geon and relation units of module i. The

activation of each unit in the IGA is computed as a Weber fraction of its input:

(3)

where nt i is the ith element of the IGA's input vector.

Each submodule of an open module sends its output to the part of the substructure

matrix corresponding to its position relative to that open module (Figure 3b). The input,

nt k , to the kth geon array in the substructure matrix is given by the vector sum of geon

arrays, g j , over all  j  that are submodules of an open module (subject to the one-level

constraint). Each term of the vector sum (i.e., each g j ) is weighted by the difference

between k's position in the matrix (pk ) and j's position relative to the mean positions of all

open modules (p jt):

(4)

where the function S returns 1 if  j is a subset of  i (where i is open) and zero otherwise.

wk  is the width of k's receptive field in the substructure matrix. The receptive fields of 

the arrays in the matrix are defined so that adjacent arrays have overlapping receptive

fields, and wk  increases with k 's distance from the center of the substructure matrix.

Thus, the position of each sub-part relative to the whole part is coded coarsely over the

17 arrays in the matrix. H implements the one level constraint, returning zero if there isany active module m such that j is a subset of m, m is a subset of i (otherwise, H returns

1). Activation in the substructure matrix is given by (3).

Relations Like many other aspects of the model, relations are highly simplified.

Each module contains five units that code its position relative to the other active modules.These are above (active when a module is above another module), below, beside (active

when a module is either left or right of another module), larger, and smaller. Whenever

n IGA

= Gi

i

∑ gi

,

ai

=

ni

1+ ni

t ,

nk 

t = Gi

 j 

∑i

∑ S( j ,i)H(i , j ,m)g j 

t 1−

p k  − p j 

wk 

 

   

  

+

,

Page 17: H&S96, Distrib Vers

8/6/2019 H&S96, Distrib Vers

http://slidepdf.com/reader/full/hs96-distrib-vers 17/28

Hummel & Stankiewicz, Structural Description Page 17

a module sends its output to the substructure matrix, the matrix activates the module's

relation units. For instance, when module A in Figure 5 is open, B's output will go to theupper center of the substructure matrix and C's will go to the left of the matrix. The

matrix will activate B's above unit, and C's beside unit. Relative position is thus coded in

terms of each module's categorical position relative to the part of which it was most

recently a subset. Relative size is computed in a similar fashion. The substructure matrixactivates a module's smaller unit (to send output to the matrix, a module must be a

submodule of -- and, therefore smaller than -- another module), and when a module

opens, any activation in the substructure matrix activates its larger unit. A module's

relations do not go to the IGA until the module is open (Eq. 2). The relative size unitstake binary activations, but relative position units take real valued activations. For each

module i:

(5)

(6)

(7)

where  xi and  yi are the coordinates of the center of  i's receptive field,  x j  and  y j  are the

mean coordinates of all open modules' receptive fields, and rad  j  is the radius of the

largest open receptive field.

Illustration Figure 5 illustrates the modules' operation. Consider first what

happens when all eight contours in the image fire at the same time (Figure 5a). Contours1...3 are consistent with module B so they will tend to push GB to 1, but contours 4...8,

also firing, will keep GB at zero (Eq. 1). GC and GD will similarly be kept at zero. All

eight contours are consistent with A, so GA will go to 1 (Eq. 1), and A will send its

output to the IGA (Eq. 2). B, C, and D will send their outputs to the parts of the

substructure matrix corresponding to their positions relative to A (Eq. 4). Next consider

how the modules will behave once enough time has elapsed to get contours 1...3 out of synchrony with 4...8. When 1..3 fire (Figure 5b), B will receive input from consistent

contours and no input from inconsistent contours, so GB will be 1. By the smallest

module constraint, GA will be 0 (Eq. 1; and Figure 4a). Now open, B will send its output

to the IGA, but because it has no submodules, nothing will go to the substructure matrix.

When 4...8 fire (Figure 5c), A will again open, and send its output to the IGA; and C and

D will continue to go to the substructure matrix. Note that C and D will not open inresponse to contours 4...8 because they do not satisfy connectedness (both are

inconsistent with contour 6). Modules B and A will alternate in this fashion as long astheir respective contours continue to fire out of synchrony. The resulting patterns of 

activation in Layer 4 will be a structural description specifying the shapes of A and B, B'srelation to A, and A's substructure (via C and D).

 Above i = [2( y j  − yi) / rad  j ]+

,

 Below i = [2( yi − y  j ) / rad  j ]+

,

 Besidei = [2| x i− x  j | /rad  j ]

+

,

Page 18: H&S96, Distrib Vers

8/6/2019 H&S96, Distrib Vers

http://slidepdf.com/reader/full/hs96-distrib-vers 18/28

Hummel & Stankiewicz, Structural Description Page 18

Figure 5. Illustration of the behavior of Layers 3 and 4 over time. See text for details.

Layers 5 and 6: Object Classification 

The model's fifth and sixth layers use the representation produced on Layer 4 as a

basis for object classification. Each unit in Layer 5 responds selectively to a single geonfeature assembly (GFA), an instantaneous pattern of activation on Layer 4 describing one

part in terms of its shape, its relations to other parts, and its substructure. Each unit in

Layer 6 sums the output of Layer 5 over time to activate a local representation of a

complete object. The connections to Layers 5 and 6 are set by the simple trainingprocedure described in the Simulations section.

Bottom-up input to Layers 5 and 6 is strictly excitatory. The excitatory input to

each GFA unit in Layer 5 is the sum of dot products:

(8)

where gt  and rt  are the activation vectors on the geon and relation units of the IGA,

respectively, st  is the activation vector on the substructure matrix, and wgi, wri, and wsi 

are the corresponding weight vectors to unit i. During training, the weights to a GFA unit

are determined by setting each vector equal to the corresponding activation vector in theto-be-learned pattern and then normalizing its length. In the simulations reported here,

wgi and wri are normalized to length 2, and wsi to length 1. Initially, each GFA unit's

activation is set to its excitatory input. GFA units compete via shunting inhibition:

(9)

This equation is iterated three times per t . Unit i's output is taken as the greater of itscurrent activation and its previous output multiplied by a decay:

(10)

A

C D

B

IGA

SSM

Time = 1

A C D

B

B

IGA

SSM

Time = 40

B

A

C D

IGA

SSM

Time = 65

A C D

(a) (b) (c)

eit  = gt 

w gi + rt ⋅

wri + st ⋅

wsi ,

ait = (ai

)

3

(a j 

t )

3

 j 

∑, (a j 

t )3

 j ∑ > 0.

oi

= max(ai

,0.95oi

t −1).

Page 19: H&S96, Distrib Vers

8/6/2019 H&S96, Distrib Vers

http://slidepdf.com/reader/full/hs96-distrib-vers 19/28

Hummel & Stankiewicz, Structural Description Page 19

Because GFA outputs decay gradually, object units effectively sum their inputs over

time.

The excitatory input to object unit i is the sum of the outputs of GFA units to

which it is connected, weighted as a Weber function of the number its GFAs:

(11)

where GFA unit  j  is connected to object unit i and ni is the number of GFA units

connected to i. Eq. 11, derived from a function proposed by Marshall (1994), allows

objects with few GFAs to compete with objects that have many GFAs. The initial

activation of an object unit is set to its excitatory input, and object units compete viashunting inhibition (Eq. 9). In the case of object units, Eq. 9 is applied only once per t .

3. Simulations

The purpose of this modeling effort is to determine whether the architecture of 

Layers 3 and 4 can generate structural descriptions but still support recognition givenerrors of dynamic binding. Such errors are especially likely when an image is first

presented as input, before the units in Layers 1 and 2 have had time to establish correct

synchrony (recall Figure 2). The model's recognition performance immediately after thepresentation of an image (i.e., at t = 1) therefore provides an index of its ability to tolerate

binding errors and recognize objects rapidly. By itself, this test is not sufficient,

however. The substructure matrix, which will play the dominant role in the model's early

performance, is not, by itself, a structural description. It is equally important to testwhether the model actually uses the structural descriptions it generates given time to

segment an object into parts. To this end, it is important to test the model's ability to

recognize objects under left-right reflection: The IGA encodes both left-of and right-of as

beside, so the structural description of an object facing to the left will be the same as thestructural description of that object facing to the right. By contrast, left-right refection

will change an object's representation in the substructure matrix, so recognition based on

the substructure matrix alone will be left-right sensitive. If the model can exploit its

structural descriptions for object recognition, then the following pattern should obtain:Trained on an object facing left, it should be able to recognize that object facing to the

right, but only after it segments it into its parts. The simulations discussed below were

explicitly designed to test for these effects.

Stimuli

The model was trained to recognize the twelve objects depicted in Figure 6.

Although the images are idealized line drawings rather than gray-scale images, most of 

the objects are structurally complex. The complexity takes two forms. First, most

objects are composed of several parts, increasing likelihood that two or more parts willfire in synchrony accidentally. And second, some of them (e.g., slide, gun, and fish) are

difficult to segment based on the local cues used by the Hummel (1994) grouping

algorithm. This difficulty will produce variability in the way these objects are segmented

ei

t = [ o j 

 j =1

ni

∑ ] / (1+ ni ),

Page 20: H&S96, Distrib Vers

8/6/2019 H&S96, Distrib Vers

http://slidepdf.com/reader/full/hs96-distrib-vers 20/28

Hummel & Stankiewicz, Structural Description Page 20

(e.g., sometimes the head of the fish is grouped as part of the body, and sometimes it is

separated as a distinct part). This complexity makes it possible to observe the model'srobustness to non-trivial errors of binding. The most important objects are the four

widgets. These objects were explicitly deigned to yield similar patterns on the IGA when

their parts fire in synchrony with one another. The model's ability to recognize them

before their parts are desynchronized thus constitutes an important test of its ability totolerate accidental synchrony. The widgets were also designed to be left-right

asymmetrical, so that the model's ability to recognize them under left-right reflection will

be a meaningful test of its capacity for left-right invariance.

Figure 6. The object views the model was trained to recognize for the simulations reported here.

Training 

The model was trained on only one view of each object (the view depicted in

Figure 6). For each view, training proceeded as follows. First, one unit in Layer 6 was

designated to represent the object as a whole. Then the contours in the view were

presented to Layer 1, and the model was allowed to run as described above. Each time atleast one module opened, a pattern of activation (a GFA) was produced in Layer 4. A

GFA unit was created for each part into which the object was segmented during the run.

The unit's excitatory weights were initially set to GFA itself and then normalized asdescribed above. The new GFA unit was then connected to the current object unit. This

procedure resulted in N+1 GFAs for any given object, one GFA for the object as a whole

(before it could be segmented), and one for each of the N smaller parts into which the

Widget 1Widget 2

Widget 3Widget 4

FishHammer

Plug Gun

Chair

Pot

Slide Scissors

Page 21: H&S96, Distrib Vers

8/6/2019 H&S96, Distrib Vers

http://slidepdf.com/reader/full/hs96-distrib-vers 21/28

Hummel & Stankiewicz, Structural Description Page 21

object was segmented during the run. In general, N for each object was close to the

number of its geons.

General Simulation Procedure 

We ran two sets of simulations. They were run under exactly the same procedure,

varying only the images presented. On each run, an image was presented to the model's

first layer, and the model was allowed to run for 180 iterations. Two measures of performance were recorded on each run.   Initial activation is the activation of an object

unit on the very first iteration after the presentation of an image (i.e., at t = 1).  Mean

activation is the mean activation of an object unit over an entire 180 iteration run. Initial

activation indicates the model's ability to recognize an object rapidly (before it issegmented into parts), and mean activation indicates its ability to recognize an object

from its structural description. All the simulations described below were run ten times,

and the mean of these response measures over the ten runs is reported. The data are

depicted as matrices (see, e.g., Figure 7), with the initial activations in one matrix and themean activations in the other. Each cell of a matrix shows the mean value (over the ten

runs) of the response metric for one object unit (columns) in response to one image

(rows). The diagonals of a matrix correspond to the "correct" object unit for each image.

Response magnitudes are depicted as circles, whose area is proportional to the value of the mean value of the response metric over the ten runs (rounded to the nearest 0.1).

The numbers in the right-most column and bottom-most row of each matrix

indicate, respectively, the degree to which the corresponding image selectively activated

the correct object unit (image selectivity), and the degree to which each object unitselectively responded to the correct image (object selectivity). Selectivity was calculated

as the response in the correct cell (i.e., the value on the diagonal) divided by responses in

all other cells in the same row or column, respectively. Image selectivity indicates the

overall accuracy of the model's response to the corresponding image: An imageselectivity of, say, 0.5 indicates that half all object unit activity in response to that image

was accounted for by the correct object unit. Object selectivity indicates the degree to

which the object unit responded selectively to the object on which it was trained: Anobject selectivity of 0.5 means that half of the unit's activity (over all images) was

generated in response to the correct image. Each simulation is also reported in terms of 

the mean object and image selectivities (bottom right corner of each matrix), as indices of 

overall performance.

Simulation 1: Trained Images

The most basic index is the model's performance is ability to recognize eachobject on which it was trained in the view that was used for training. This test was

performed as described above, where the images presented for testing were those used fortraining. The model's performance (averaged over ten runs) is shown in Figure 7 (at end

of ducument). Figure 7a shows the initial activation response metric, and Figure 7b

shows the mean activation metric.

Page 22: H&S96, Distrib Vers

8/6/2019 H&S96, Distrib Vers

http://slidepdf.com/reader/full/hs96-distrib-vers 22/28

Hummel & Stankiewicz, Structural Description Page 22

The model's initial recognition performance is quite good (Figure 7a). In

response to the image of gun, the initial activation produced by the object unit for  plug was higher than the initial activation of the object unit for gun (0.52 vs. 0.48,

respectively); the model initially mistook the image of  gun for an image of  plug.

However, all other image selectivities are 1.00, indicating that initial recognition of all

other images was flawless. In fact, this level of performance is too high: Althoughbiological visual systems recognize faces to a high degree of certainty based on the first

spikes to reach IT, recognition is not perfect (see Tovee, et. al., 1993). The model's

overestimate of initial performance may reflect a few factors, including (a) its small

vocabulary (only 12 objects), or (b) the fact that each image was presented for testingexactly as it was trained.

Performance over the full 180 iterations (Figure 7b) is interesting in comparison

to initial performance. As measured by mean object selectivity, performance over 180

iterations was slightly worse than initial performance (0.95 vs. 0.96, respectively). Thisnoisier responding is also visible as the four off-diagonal circles in Figure 7b (compared

with only one in Figure 7a). However, none of the off-diagonal mean activations is

greater than any mean activation on the diagonal. Over the full 180 iterations, the model

made no catastrophic errors of the initial gun-for- plug variety: Given time to generate astructural description, the model recognized every object correctly. The greater noise

observed over the full 180 iterations is actually expected given the nature of the model's

structural descriptions. Recall that dynamic binding in an independent attribute code

preserves the attribute structures of the represented entities. This preservation of attributestructures manifested itself as a tendency for the geons of one object to activate GFA

units for similar geons in other objects, resulting in noise once an object was segmented.

By contrast, when all an object's geons fire together (as on the first iteration), each geonis represented on a different region of the substructure matrix, so parts with similar

shapes can wind up represented on completely different units. This loss of attribute

structures manifests itself "cleaner" (i.e., more all-or-none) initial performance.

Overall, the results of this simulation show that the model can recognize objectsbefore they are segmented into their parts. Moreover, the model's ability to correct the

initial  plug-for-gun error suggests that the structural descriptions it generates on later

iterations actually aid object classification.

Simulation 2: Left-Right Reflection 

The second simulation was designed as a stronger test of the model's capacity to

use the structural descriptions it generates. As discussed above, one property of the

structural descriptions generated by this model is that they are invariant under left-rightreflection. This invariance is not a property of the substructure matrix. If the model canuse its structural descriptions for object classification, then recognition performance with

a left-right reflection of a trained image should be poor initially, and improve once the

image has been segmented into its parts. (Note that complete left-right invariance is not

expected even after an object is decomposed into its parts, as the substructure matrixcontinues to play a role, representing the sub-structure of each part). This simulation was

Page 23: H&S96, Distrib Vers

8/6/2019 H&S96, Distrib Vers

http://slidepdf.com/reader/full/hs96-distrib-vers 23/28

Hummel & Stankiewicz, Structural Description Page 23

conducted as described above, with left-right reflections of the trained images used as

stimuli. The results are shown in Figure 8 (at end of document).

As expected, initial performance under left-right reflection was poorer than initial

performance with the original images. The mean image and object selectivities on the

first iteration of this simulation were 0.70 and 0.69, respectively, as compared to 0.96 and

0.95 with the original images. However, initial performance with the reflected imageswas better than expected, revealing only three catastrophic errors: gun mistaken for plug;

widget2 for  plug; and widget4 for widget1. There was also some tendency to mistake

slide for widget1. Although initial performance was better than expected, it nonetheless

improved once the model segmented an image into a structural description. The mean

activation response metric (Figure 8b) reveals zero catastrophic errors, but some

tendency to confuse slide for chair. This improvement relative to initial performance

shows that the model's structural descriptions contribute to its capacity for recognition

under left-right reflection. Interestingly, the only near confusion on the first iteration thatpersisted throughout the remaining 179 was a slight tendency to confuse widget2 for gun.

The remainder of the initial near confusions were replaced by a completely new set of 

near confusions over subsequent iterations. The precise reason for this pattern is unclear,

but in general it suggests that similarity of the reflected images to the model's storedview-like representations (in the substructure matrix) differs from the similarity of 

reflected images to the model's stored structural descriptions.

4. Discussion

An important problem confronting structural description theories of objectrecognition is to explain how recognition can happen both rapidly, given that dynamic

binding takes time, and reliably, given that dynamic binding may be error prone. The

time required for dynamic binding in structural description is a special case of a general

problem confronting all models of object recognition: In contrast to most models,biological visual systems do not wait around for time-consuming procedures to operate

before they generate an estimate of object identity. The approach to object recognition

proposed here is explicitly designed to address these limitations of previous models. Thisapproach is based on the use of both dynamic and static binding. Dynamic binding is

used to generate structural descriptions that are robust to some changes in viewpoint; but

before dynamic binding can be established, or when the mechanisms that establish it fail,

static binding in a view-like representation supports reliable, albeit viewpoint-sensitive,recognition. The model thus operates in cascade (McClelland, 1979), giving an initial

estimate of object identity early on, and refining that estimate as its critical processes

continue to run. In this respect, it differs from other models of object recognition, most

of which must wait for their critical operations to run to completion before an estimate of object identity can be generated.

The idea that the visual system might exploit both view-like representations and

structural descriptions for object recognition is not unique to this proposal (see, e.g.,

Bülthoff & Edelman, 1992; Farah, 1992; Tarr & Pinker, 1989, 1990). However, to ourknowledge, no one has explicitly specified either how these two types of object

representation might be integrated in a single system, or what the properties of the

Page 24: H&S96, Distrib Vers

8/6/2019 H&S96, Distrib Vers

http://slidepdf.com/reader/full/hs96-distrib-vers 24/28

Hummel & Stankiewicz, Structural Description Page 24

resulting system would be. The architecture presented here constitutes an explicit

proposal for how these two approaches to object representation might work together in asingle system. Interestingly, our motivation was entirely different from the motivation of 

most others who propose hybrid representations. Most often, the hybrid is proposed as a

general account of various properties of human object recognition. For instance, Bülthoff 

and Edelman (1992) note that view-based representations may provide a better account of our ability to identify particular objects (e.g., our own car in a parking lot full of objects

that are structurally similar to it), while structural descriptions may provide a better

account of class recognition (e.g., the fact that the other objects appear similar to our car

in the first place). Tarr and Pinker (1990, 1992) proposed a hybrid representation toaccount for the strengths and limitations of our ability to recognize objects that are

misoriented in the picture plane (see also Jolicoeur, 1990). By contrast, we proposed the

hybrid as a solution to the specific problem of coping with binding errors in a system for

structural description. Whether our model can inform any of these other issues remainsto be seen, but preliminary simulation results are suggestive, as they show that the model

overcomes some important limitations of both approaches. For instance, in contrast to

most view-based models (but see Poggio & Vetter, 1992), this model can recognize anobject in a novel viewpoint after exposure to only a single view; and given the primitiveson which its representations are based (i.e., geons and categorical relations), it will likely

reveal the class recognition properties of a structural description (although it is necessary

to run the appropriate simulations). In contrast to most structural description models, it is

not overly sensitive to the manner in which an object is segmented into parts.

The model is in an early stage of development, and a great deal more exploration

and refinement of its properties is still necessary. However, it is possible to speculate

with caution about a few additional issues to which this approach to object representationmay speak. One is the computation of relative position. The model computes the

position of one geon relative to another only when they are substructures of the same

open module; because of the connectedness constraint, two geons will only be

substructures of the same module if they are components of the same connected part.Hence, the model is biased to compute relations only between connected parts. This bias

directly provides a way to prune the number of inter-part relations explicitly computed

for recognition: Although there are 1/2(N2-N) pair-wise relations between N parts in a

scene, the number of relations between touching parts will tend to be much smaller.

Hence, the model's connectedness bias may provide a way to reduce the cost of representing inter-part relations without sacrificing those relations that are likely to be

important for object recognition. This bias also suggests the beginnings of an account of 

figure-ground segregation: If every part is explicitly related only to the other parts in a

scene to which it is directly connected, then the majority of relations made explicit in anyscene will tend to belong to parts of the same object. Behaviorally, the model's

connectedness bias predicts that connectedness should play an important role in the

perception of relative position, even when other factors (such as distance, etc.) are held

constant. Some recent data from our laboratory (Saiki & Hummel, in preparation)support this prediction. Finally, the proposed architecture supports object recognition in

the absence of reliable dynamic binding, so inasmuch as one function of visual attention

is to bind together object attributes (e.g., Treisman and Gelade, 1980), the architecture

Page 25: H&S96, Distrib Vers

8/6/2019 H&S96, Distrib Vers

http://slidepdf.com/reader/full/hs96-distrib-vers 25/28

Hummel & Stankiewicz, Structural Description Page 25

suggests an account of object recognition in the absence of attention (e.g., Tipper, 1985).

Moreover, the limitations of its performance in the absence of dynamic binding (e.g., thelack of invariance under left-right reflection) constitute testable predictions about the

limitations of human object recognition in the absence of attention.

References

Barsalou, L. W. (1993). Flexibility, structure and linguistic vagary in concepts:Manifestations of a compositional system of perceptual symbols. In A. F. Collins, S.

E. Gathercole, M. A. Conway, & P. E. Morris (Eds.), Theories of Memory. Hillsdale,

NJ: Lawerence Erlbaum.

Bergevin, R. and Levine, M. D. (1993). Generic object recognition: Building andmatching course descriptions from line drawings.   IEEE Transactions on Pattern

 Analysis and Machine Intelligence, 15, 19-36.

Biederman, I. (1987). Recognition-by-components: A theory of human image

understanding. Psychological Review, 94 (2), 115-147.

Biederman, I. & Cooper, E. E. (1991). Priming contour deleted images: Evidence for

intermediate representations in visual object recognition. Cognitive Psychology, 23,

393-419.

Biederman, I., & Gerhardstein, P. C. (1994). Recognizing depth-rotated objects:Evidence and conditions for 3-dimensional viewpoint invariance.   Journal of 

 Experimental Psychology: Human Perception and Performance, 19, 1162-1182.

Bülthoff, H. H. & Edelman, S. (1992). Psychophysical support for a two-dimensional

view interpolation theory of object recognition. Proceedings of the National  Academy of Science, 89, 60-64.

Clowes, M. B. (1967). Perception, picture processing and computers. In N.L. Collins &

D. Michie (Eds.),   Machine Intelligence, (Vol 1, pp. 181-197). Edinburgh, Scotland:Oliver & Boyd.

Cooper, E. E., & Biederman, I. (in preparation). Object recognition under brief 

exposures. University of Southern California.

Cooper, E. E., Biederman, I., & Hummel, J. E. (1992). Metric invariance in objectrecognition: A review and further evidence. Canadian Journal of Psychology, 46 ,

191-214.

Dickinson, S. J., Pentland, A. P., & Rosenfeld, A. (1992). 3-D shape recovery using

distributed aspect matching.   IEEE Transactions on Pattern Analysis and Machine

 Intelligence, 14, 174-198.

Eckhorn, R., Bauer, R., Jordan, W., Brish, M., Kruse, W. Munk, M. & Reitboeck, H.J.

(1988). Coherent oscillations: A mechanism of feature linking in the visual cortex?

Page 26: H&S96, Distrib Vers

8/6/2019 H&S96, Distrib Vers

http://slidepdf.com/reader/full/hs96-distrib-vers 26/28

Hummel & Stankiewicz, Structural Description Page 26

Multiple electrode and correlation analysis in the cat.   Biological Cybernetics. 60,

121-130

Eckhorn, R., Reitboeck, H., Arndt, M., & Dicke, P. (1990). Feature linking via

synchronization among distributed assemblies: Simulations of results from cat visual

cortex.  Neural Computation. 2, 293-307.

Farah, M. (1992). Is an object an object an object: Cognitive and neuropsychologicalinvestigations of domain-specificity in visual object recognition. Current Directions

in Psychological Science, 1, 164-169.

Fodor, J.A. & Pylyshyn, Z.W. (1988). Connectionism and cognitive architecture: A

critical analysis. In Pinker, S., & Mehler, J. (eds.) Connections and Symbols. MITPress, Cambridge, MA.

Gray, C. M., Konig, P., Engel, A. E., & Singer, W. (1989). Oscillatory responses in cat

visual cortex exhibit inter-column synchronization which reflects global stimulus

properties.  Nature, 338, 334-337.

Gray, C. M., & Singer, W. (1989). Stimulus specific neuronal oscillations in orientation

columns of cat visual cortex. Proceedings of the National academy of Sciences, USA

86, 1698-1702.

Hoffman & Richards (1985). Parts of recognition. Cognition, 18, 65-96.

Hummel, J. E. (1994). Object recognition. In M. Arbib (ed.).   Handbook of Brain

Theory and Neural Networks, in press.

Hummel, J. E., Biederman, I. (1990). Dynamic binding: A basis for the representation of shape by neural networks. Proceedings of the 12th Annual Conference of the

Cognitive Science Society, pp 614-621.

Hummel, J. E., & Biederman, I. (1992). Dynamic binding in a neural network for shape

recognition. Psychological Review, 99, 480-517. 

Hummel, J. E., Burns, B., & Holyoak, K. J. (1994). Analogical mapping by dynamic

binding: Preliminary investigations. In K. J. Holyoak & J. A. Barnden (Eds.),

  Advances in connectionist and neural computation theory, Vol. 2: Analogical 

connections. (pp. 416-445). Norwood, NJ: Ablex.

Hummel, J. E. & Holyoak, K. J. (1993). Distributing structure over time. Review of LShastri and V. Ajjenegadde, From simple associations to systematic reasoning: A

connectionist representation of rules, variables and dynamic bindings.  Behavioral 

and Brain Sciences., p. 464.

Hummel, J. E. & Stankiewicz, B. J. (1994). Segmenting images at matched concavities

with synchrony for binding. Technical report 94-1, Shape Perception Laboratory,

University of California, Los Angeles.

Page 27: H&S96, Distrib Vers

8/6/2019 H&S96, Distrib Vers

http://slidepdf.com/reader/full/hs96-distrib-vers 27/28

Hummel & Stankiewicz, Structural Description Page 27

Intraub, H. (1981). Identification and processing of briefly glimpsed visual scenes. In D.

Fisher, R. A. Monty, and J. W. Sender (Eds.).  Eye movements: Cognition and Visual 

Perception, pp. 181-190. Hillside, NJ: Erlbaum.

Jolicoeur, P. (1990). Identification of disoriented objects: A dual systems theory.  Mind 

& Language, 5, 387-410.

Lowe, D. G. (1987). The viewpoint consistency constraint.   International Journal of 

Computer Vision, 1. 57-72.

Marr, D., & Nishihara, H. K. (1978). Representation and recognition of three dimensional

shapes. Proceedings of the Royal Society of London, Series B. 200, 269-294.

Marshall, J. A. (1990). A self-organizing scale-sensitive neural network. Proceedings of 

the International Joint Conference on Neural Networks, San Diego, CA, June, vol 3,

649-654.

McClelland, J. L. (1979). On the time relations of mental processes: An examination of systems of processes in cascade. Psychological Review, 86 , 287-330.

Oram, M. W. & Perrett, D. I. (1992).  Journal of Neurophysiology, 68, 70-84.

Palmer, S. E. (1975). Visual perception and world knowledge: Notes on a model of 

sensory-cognitive interaction. In D.A. Norman & D. E. Rumelhart (Eds.),

 Explorations in Cognition (pp. 279-307). San Francisco: Freeman.

Palmer, S. E. (1977). Hierarchical structure in perceptual representation. Cognitive

Psychology, 9, 441-474.

Poggio, T. & Edemlan, S. (1990). A neural network that learns to recognize three-

dimensional objects.  Nature, 317 , 314-319.

Poggio, T. & Vetter, T. (1992). Recognition and structure and from one 2D model view:

observations on prototypes, object classes, and symmetries. MIT AI Memo 1347,

Massachussetts Institute of Technology.

Quinlan, P. T. (1991). Differing approaches to two-dimensional shape recognition.

Psychological Bulletin, 109 (2), 224-241.

Saiki, J. & Hummel, J. E. (in preparation). The role of connectedness in binding parts to

relations for object categorization.

Shastri, L., & Ajjanagadde, V. (1993). From simple associations to systematic reasoning:

A connectionist representation of rules, variables and dynamic bindings.  Behavioral 

and Brain Sciences, 16 , 417-494.

Page 28: H&S96, Distrib Vers

8/6/2019 H&S96, Distrib Vers

http://slidepdf.com/reader/full/hs96-distrib-vers 28/28

Hummel & Stankiewicz, Structural Description Page 28

Siebert, M. & Waxman, A. M. (1992). Learning and recognizing 3D objects from

multiple views in a neural system. In H. Wechsler (Ed.),   Neural Networks for

Perception, Volume 1, Human and Machine Perception. Academic Press. 427-444.

Smolensky, P. (1990). Tensor product variable binding and the representation of 

symbolic structures in connectionist systems.  Artificial Intelligence, 46 , 159-216.

Sutherland, N. S. (1968). Outlines of a theory of visual pattern recognition in animalsand man. Proceedings of the Royal Society of London (Series B), 171, 95-103.

Tarr, M. J. & Pinker. S. (1989). Mental rotation and orientation dependence in shape

recognition. Cognitive Psychology, 21, 233-283.

Tarr, M. J. & Pinker. S. (1990). When does human object recognition use a viewer-centered reference frame? Psychological Science, 1(4), 253-256.

Tipper, S. P. (1985). The negative priming effect: Inhibitory effects of ignored primes.

Quarterly Journal of Experimental Psychology, 37A, 571-590.

Treisman, A., & Gelade, G. (1980). A feature integration theory of attention. Cognitive

Psychology, 12, 97-136.

Tovee, M. J., Rolls, E. T., Treves, A., & Bellis, R. P. (1992). Information encoding and

the responses of single neurons in the primate inferior temporal cortex, preprint.

Ullman, S. (1989). Aligning pictoral descriptions: An approach to object recognition.Cognition, 32, 193-254.

Ullman, S. & Basri, R. (1991). Recognition by liner combinations of models.  IEEE 

Transactions on Pattern Analysis and Machine Intelligence, 13, 992-1006.

von der Malsburg, C. (1981). The correlation theory of brain function. Internal Report81-2. Department of Neurobiology, Max-Plank-Institute for Biophysical Chemistry.

Gottingen, Germany.

Winston, P. (1975). Learning structural descriptions from examples. In P. Winston, The

Psychology of Computer Vision (pp. 157-209). New York: McGraw-Hill.

Author Notes

This research was supported by a research grant from the UCLA Academic Senate toJEH. We are grateful to Lisa Travis, Irv Biederman, Jun Saiki, and Nancy Kanwisher forhelpful discussions about this work, and to Lisa Travis for her extremely helpful

comments on an earlier draft of this manuscript.