- Home
- Documents
*Chapter 1 Chapter 1 Methods based on spatial processes Lisa Handl, Christian Hirsch and Volker...*

prev

next

out of 35

View

0Download

0

Embed Size (px)

Chapter 1

Methods based on spatial processes

Lisa Handl, Christian Hirsch and Volker Schmidt

Institute of Stochastics, Ulm University, 89069 Ulm, Germany

Abstract

We present methods of cluster analysis to identify and classify patterns which are formed by stationary point

processes and used e.g. in the modeling of human cartilage cells. In particular, we consider a generalization of the

single-linkage approach which has been appropriately adapted to deal with inhomogeneous data and describe how

the random-forest methodology can be applied to the classification of the previously identified clusters.

Furthermore, we show how statistical techniques for spatial point processes can be used to validate the methods of

cluster identification and cluster classification considered in this chapter.

1.1 Introduction

In this chapter we present methods of cluster analysis to identify and classify patterns which

are formed by stationary point processes and used e.g. in the modeling of human cartilage

cells. In particular, we consider a generalization of the single-linkage approach adapted to

deal with inhomogeneous data and describe how the random-forest methodology can be

1

2 CHAPTER 1. METHODS BASED ON SPATIAL PROCESSES

applied to the classification of the previously identified clusters. Note that the problem of

clustering point patterns occurs not only at microscopic scales in fields like biology, physics,

and computational materials science, but also in geographic applications at macroscopic

scales. Often point-process models play an important role in these applications and the

development of special cluster algorithms adapted to the problem at hand is imperative.

Let us first discuss a macroscopical example which builds on a point-process model de-

veloped in [3]. As explained in [4] due to recent technological advances it is now possible to

cost-efficiently produce small sensors which can be used to gather information on their envi-

ronment. Due to their small size such sensors can carry only very limited battery capacities

and devising energy-efficient computation schemes is crucial. As the cost of transmitting

information over large distances is usually higher than the cost of computation itself, it

makes sense to group these sensors into hierarchical clusters so that only one node in each

cluster has to transmit the information to the next hierarchy level. In particular it is de-

sirable to have a cluster algorithm that is capable of creating a large number of moderately

sized clusters. The classical single-linkage algorithm is inappropriate for this task. Indeed,

by percolation theory we can either observe the occurrence of an infinite component or we

obtain a large number of clusters consisting of only very few sensors, see e.g. [25, 31].

Next let us discuss an application at a microscopic scale. Recently it was observed [26,

33, 34] that the arrangement of chondrocyte cells in the superficial zone of human cartilage

form distinct patterns according to the health status of the cartilage, see Figure 1.1 for an

illustration. By being able to analyze these patterns and to understand their connection

to the health status of the examined cartilage, it might become possible to see already on

the scale of cells, where degenerative diseases such as osteoarthritis might develop soon. In

order to create a diagnostic tool based on this observation, one important step is to develop

point-process models for such cartilage patterns, see [26] for early results in this direction.

Using this type of stochastic model makes diagnoses based on cell patterns amenable to

rigorous statistical testing.

Motivated by these applications, the goal of this chapter is to describe how statistical

techniques for spatial point processes can be used in the validation of cluster identification

1.2. GENERALIZED SINGLE-LINKAGE ALGORITHM 3

Figure 1.1: Chondrocyte patterns in the superficial zone of the condyle of the human knee joint. Figure taken from [33].

and classification steps, where the chapter is structured as follows. In Section 1.2, we present

a generalization of the classical single-linkage algorithm which may be useful in situations

where substantial inhomogeneities of cluster densities can be observed. To tackle the problem

of cluster classification we recall in Section 1.3 the probabilistic method based on random

forests which was introduced in a seminal paper of Breiman [7]. In Section 1.4, we show how

the introduced methods can be applied to point patterns formed by cell nuclei in human

cartilage tissue. Finally in Section 1.5 we show how statistical techniques for spatial point

processes can be used in the validation of the identification and classification steps, where

we will first describe the model used as test scenario and then explain our results.

1.2 Generalized single-linkage algorithm

We first review some techniques to reconstruct clusters and correctly determine their shape

from a single realization of the model. Naturally this task splits up into two parts. In the

current section we discuss the issue of cluster identification while Section 1.3 is devoted to

the discussion of the classification step. For generalities on the single-linkage algorithm and

other classical hierarchical clustering algorithms we refer the reader to Chapter ??. reference to Chapter 2.1

4 CHAPTER 1. METHODS BASED ON SPATIAL PROCESSES

1.2.1 Description of the algorithm

When considering the identification of clusters generated by a point-process model, the

classical method of the single-linkage algorithm seems a reasonable choice. Indeed this

algorithm is capable of identifying e.g. long string-like clusters generated inside elongated

ellipsoids, see Section 1.5. However the presence of inhomogeneities of cluster densities

make it difficult to apply this algorithm directly. Recall from Chapter ?? that the classicalreference to Chapter 2.1

single-linkage algorithm has only one parameter corresponding to a global threshold value.

Therefore it is difficult to choose a single global threshold value appropriate for both dense

as well as sparse clusters. We return to this point in Section 1.5.4 where we show numerical

evidence for this observation. We also refer the reader to the survey paper [28] for related

generalizations of classical single-linkage (see also [17] and [32]).

In order to describe our generalization of the classical single-linkage algorithm, we recall

some basic notions from graph theory. A graph G is called locally finite if every vertex has

finite degree. Furthermore a rooted graph is a graph G = (ϕ,E) together with a distinguished

vertex v0 ∈ ϕ. Also recall that a minimum spanning tree on a finite set ϕ ⊂ R3 is a tree of minimal total length among all trees whose vertices are given by ϕ. Recall from

Chapter ?? that the single-linkage algorithm defines the clustering corresponding to thereference to Chapter 2.1

connected components of the graph obtained from deleting all edges of length larger than

the global threshold value from the minimum spanning tree. To deal with the problem of

varying cluster densities described above we propose to consider a generalization of the single-

linkage algorithm capable of taking into account not only the length of a specified edge but

also the local geometry of the minimum spanning tree close to this edge. This approach can

be seen as a mathematical formalization of the graph-theoretical methods introduced in [36]

for the detection of clusters of different shapes which also make use of local characteristics

of minimum spanning trees.

Definition 1. Write G∗ for the set of all locally finite rooted graphs in R3. A splitting rule is a function f : G∗×G∗× [0,∞)→ {0, 1} satisfying f(g1, g2, x) = f(g2, g1, x) for all g1, g2 ∈ G∗

and x ∈ [0,∞).

A splitting rule can be used to define a clustering in the following way.

1.2. GENERALIZED SINGLE-LINKAGE ALGORITHM 5

Definition 2. Let ϕ ⊂ R3 be locally finite and let f : G∗ × G∗ × [0,∞) → {0, 1} be a splitting rule. Write T = (ϕ,ET ) for the (Euclidean) minimum spanning forest on ϕ, i.e.,

{x, y} ∈ ET if and only if there do not exist an integer n ≥ 1 and a sequence of vertices x = x0, x1, . . . , xn = y ∈ ϕ such that |xi − xi+1| < |x− y| for all 0 ≤ i ≤ n − 1. For {x1, x2} ∈ ET write Tx1 , Tx2 for the connected components of (ϕ,ET \ {{x1, x2}}) containing x1 and x2, respectively (and rooted at x1 and x2, respectively). Then denote by ϕf the family

of connected components of the graph (ϕ, {{x1, x2} ∈ ET : f(Tx1 , Tx2 , |x1 − x2|) = 1}) and call ϕf the clustering of ϕ induced by f .

1.2.2 Examples of splitting rules

To illustrate the usage of the generalized single-linkage algorithm introduced in Section 1.2.1,

let us consider several examples of splitting rules.

Example 3. For f(g1, g2, `) = 1{x:x 0, where 1A is the indicator of the set A. See Figure 1.2 for an

illustration of this clustering rule.

Figure 1.2: Clustering induced by the splitting rule of Example 3

Example 4. For a finite set A denote by |A| the n