GeoSOM suite: a tool for spatial clustering · 4 Roberto Henriques1, Fernando Bação1, Victor...

GeoSOM suite: a tool for spatial clustering

Roberto Henriques1, Fernando Bação1, Victor Lobo1,2

1 Instituto Superior de Estatística e Gestão de Informação Universidade Nova de Lisboa (ISEGIUNL),

Campus de Campolide, 1070-312 Lisboa, Portugal, 2 Portuguese Naval Academy,

Alfeite, 2810-001 Almada, Portugal, {roberto, bacao, vlobo}@isegi.unl.pt

Abstract. The large amount of spatial data available today demands the use of data mining tools for its analysis. One of the most used data mining techniques is clustering. Several methods for spatial clustering exist, but many consider space as just another variable. We present in this paper a tool particularly suited for spatial clustering: the GeoSOM suite. This tool implements the GeoSOM algorithm, which is based on Self-Organizing Maps. This paper describes this tool, and shows that it is adequate for exploring spatial data. Keywords: Spatial clustering, self-organizing maps, exploratory spatial data analysis.

1 Introduction

Advances in database technologies and data collecting devices originated a massive growth in stored spatial data. This data can only be of use if the means to translate it into information and knowledge are available. Data mining tools, or more specifically spatial data mining tools, constitute one of the most promising set of tools that are available to face this flood of data. Spatial data mining can be defined as the discovery of interesting relationships, spatial patterns and characteristics that may exist implicitly in spatial databases.

Clustering constitutes one of the most powerful methodologies to tackle the problems posed by the size and dimensionality of the “new” spatial databases. Clustering is the partition of data into groups of similar objects [1], thus it tries to achieve a trade-off between complexity and interpretability. The possibility of studying homogeneous subsets of the original database is particularly appealing and effective, when one is trying to understand a complex phenomenon. The objects are usually represented as a vector of measurements or a point in a multidimensional space [2]. Kaufman and Rousseeuw [3] proposed a general classification of clustering algorithms into two groups: partitioning and hierarchical algorithms. While partitioning algorithms split a database of n objects into a set of k clusters, hierarchical algorithms create a hierarchical decomposition of the same database. More recently, Berkhin [4] classified the clustering algorithms in eight overlapping classes: (1) hierarchical, (2) partitioning (including k-means, k-medoids, probability and density based methods), (3) grid-based, (4) categorical data based, (5) constraint

2 Roberto Henriques1, Fernando Bação1, Victor Lobo1,2

based methods, (6) machine learning based, (7) scalable methods and (8) high-dimensional data methods. Many clustering algorithms have been proposed in the past (see [4] for a detailed survey). More recently, with spatial databases expansion, new algorithms or variants of existing ones have been proposed.

From the hierarchical class we would like to draw attention to CURE [5], which produces clusters of different shapes and sizes, and is quite insensitive to outliers. These characteristics allow CURE to perform well with spatial data [4]. CURE is able to handle large databases, by applying a clustering algorithm to random subsets of the database, which is naturally much faster. The algorithm then goes on to form the final clusters with the centroids obtained in this process.

CLARANS [6], HEM [7] and DBSCAN/GDBSCAN [8, 9] are examples of partitioning methods that are useful for spatial data mining.

CLARANS is a k-medoids method based on randomized search resulting from an improvement of CLARA and PAM [3]. In these methods, finding k-medoids is viewed as a searching task through nodes in a graph. Each node has a set of k objects corresponding to dataset objects. For each node a cost is defined as the total dissimilarity between every object and the medoid of its cluster. These methods search for a minimum on that graph. The main differences amongst these methods are that PAM compares, at each step, all the neighbours of the current node, while CLARA only compares some neighbours and restricts the search to subgraphs. CLARANS is a mixture of both since it also only checks some neighbours but does not restrict its search to a particular subgraph.

Hybrid Expectation Maximization (HEM) is an extension of expectation maximization (EM) suited for the specific case of spatial clustering. The EM algorithm provides a way to maximize the likelihood function even if we don’t know its parameters, by using a probabilistic method that alternates between two major steps: an expectation and a maximization step. The first step calculates the expectation of the likelihood by including latent variables and the second step estimates the parameters by maximizing the expected likelihood found on the first step [10, 11].

DBSCAN, and its extension GDBSCAN, are density-based methods that perform clustering using point and polygon data, respectively. These density-based algorithms start with an arbitrary point and find all the neighbour points within a distance threshold. If the number of neighbours found reaches a neighbour threshold then a cluster arises. The algorithm recursively applies this search for all the neighbours. When the search is complete, the algorithm selects another unvisited point, until all points belong to a cluster.

From the grid based methods, WaveCluster [12] applies wavelet transforms to filter data. The authors assume that multidimensional spatial data objects can be represented in an n-dimensional feature space and that the dataset is an n-dimensional signal. Wavelet transforms are applied to this signal, decomposing it into high and low frequency subbands. The high frequency signal occurs in the boundaries of clusters and the low frequency of the signal matches the cluster structure itself.

From machine learning arena we consider the Self-Organizing Maps (SOM) [13]. This is an example of a method that can belong to several classes. SOM is a machine learning method (because it is a neural network) but can also be seen as a k-means or a density based method, depending if one uses “k-means” SOMs or “emergent”

GeoSOM suite: a tool for spatial clustering 3

SOMs [14]. In “k-means” SOMs the number of units is the same as the number of clusters needed, while in emergent SOMs the number of units is very large, and clusters are detected from the units’ density.

Although some of the presented clustering methods (HEM, GDBSCAN) are specific or suffered adaptations to deal with spatial clustering, others, because of their innate properties, can present good results by including extra x and y variables. This is the case of CURE, DBSCAN and SOM.

Although spatial clustering is possible with these methods, they generally assume geography just as another variable in the multidimensional space. Indeed none of these methods explicitly accounts for spatial dependency (explained by the first law of geography [15]) or spatial heterogeneity [16] . In this paper we propose the GeoSOM to tackle this limitation. The GeoSOM is an extension of SOM proposed in [17] for the specific difficulties posed by spatial data, effectively implementing the idea that “everything is related to everything else, but near things are more related than distant things”. In fact, the GeoSOM uses geography as a restriction of non-spatial attributes clustering, thus all clustering is constrained by the geographic distance between cluster objects (i.e. objects in the same cluster are similar in attributes but have to be nearby geographically)

This paper extends [17] by presenting a tool called GeoSOM suite. This tool implements the GeoSOM and the SOM algorithms, providing a friendly and ready to use data exploration tool. GeoSOM suite enables the user to interact and combine multiple solutions improving the understanding and knowledge about data and the clusters produced.

The paper is organized as follows: section 2 reviews SOM and GeoSOM algorithm. Section 3 presents GeoSOM suite. Section 4 presents a case study concludes the paper and discusses future work.

2 SOM and GeoSOM

Teuvo Kohonen proposed Self-organizing maps (SOM) in the beginning of the 1980s [18]. The SOM is usually used for mapping high-dimensional data into one, two, or three-dimensional feature maps, which are grids of units or neurons. The grid forms what is known as the output space, as opposed to the input space that is the original space of the data patterns. The mapping tries to preserve topological relations, i.e. patterns that are close in the input space will be mapped to units that are close in the output space, and vice versa. The output space will usually be two-dimensional, and most SOM implementations use a rectangular grid of units. To provide even distances between neighbour units in the output space, hexagonal grids are sometimes used [13]. Each unit, being an input layer unit, has as many weights as the input patterns, and can thus be regarded as a vector in the same space of the patterns.

The basic idea during training is that each datum is compared to all units. The most similar one, known as BMU (Best Matching Unit) is said to map that datum. The BMU and its neighbours in the grid are then updated so as to be closer to that particular datum. After a sufficiently large number of iterations, the units will be a


representation of the dataset in the sense, that they will follow the same density distribution.

When clustering using SOMs we can follow two major methodologies. In the first method, each unit is a cluster centroid, so the number of required clusters should define the size of the SOM. Because this is similar to k-means clustering, this is usually referred as “k-means” SOM.

The second method uses a much larger SOM with much more units than the required number of clusters. The SOM units are in fact mapping the input space into a 2-dimensional space. We may visualize the density distribution of the units in the input space (and by proxy of the original data) mapped onto the 2-dimensional output space. The best way to analyse the density of the input space in the output space is through the use of U-matrices [19]. This approach is usually known as “emergent” SOMs [14]. The GeoSOM constitutes a variation of the original SOM algorithm, and it was devised to explicitly consider the spatial nature of data. In GeoSOM the search for the best matching unit (BMU) has two phases. The first phase settles the geographical neighbourhood where it is admissible to search for the BMU, and the second phase performs the final search using the other multidimensional components. The search neighbourhood is controlled by a parameter k, defined in the output space. Using k=0 the algorithm will necessarily select as BMU the closest geographical unit. On the other hand, setting k equal to the size of the map (the SOM), space is ignored, and the algorithm will perform just like a regular SOM. The results obtained with k=0 , will be similar to the training of a standard SOM with only the geographical locations (x,y coordinates), and then using each unit as a low pass filter for the non-geographic features. As k (the geographic tolerance) increases, the unit locations will no longer be quasi-proportional to the locations of the training patterns, and the “equivalent filter” functions of the units will become more and more skewed, eventually ceasing to be useful as models. For a detailed explanation of the workings of the GeoSOM the reader is referred to [20].

3 GeoSOM suite tool

The GeoSOM suite implementation is based on Matlab and SOM toolbox [21]. A stand-alone graphic user interface (GUI) was built, allowing non-programmer users to evaluate and use the SOM and the GeoSOM algorithms. The GeoSOM suite is freely available at www.isegi.unl.pt/labnt/geosom. Fig. 1 shows the general GeoSOM suite architecture, which consists of (1) a GUI, (2) Matlab runtime components including SOM toolbox and GeoSOM routines, (3) access to spatial and non-spatial data and (4) production of outputs. These outputs consist of geographic maps, U-matrices, component planes, hit maps, and parallel coordinate plots. The GeoSOM suite also allows the use of multiple analysis tools (e.g. the possibility of using several SOM and GeoSOM) on the same dataset.


Fig. 1. GeoSOM suite architecture

Fig. 2 presents a screen-shot of the GeoSOM suite. The main interface contains a table of attributes and a tree view pointing to all the views created. Also presented in the figure are three examples of views: a geographic data, a u-mat and a box plot.

Fig. 2. GeoSOM suite interface: from the left to the right, GeoSOM suite main window with non-spatial attributes, u-mat using census data and geographic map from Lisbon Metropolitan

Area and a box plot showing the distribution of two variables.

GeoSOM suite’s main functionalities are: (1) present and select by point and click geographical data, (2) train SOMs and GeoSOMs according to the nature of the data


used, (3) generate several graphs and tables known as views (U-matrices, Component planes, etc) and (4) link the views dynamically (i.e. when one view is clicked data is selected on all views).

3.1 Views

Views are different representations of the SOM (or GeoSOM) that play a central role in the exploration of the results and the interpretation of the outputs. Presently, GeoSOM suite includes U-matrices, component planes, hit maps and parallel coordinate plots.

The U-matrices [19] are computed by finding the distances in the input space of neighbouring units in the output space. The most common is to use a colour scheme or a grey scale to represent these distances. In the latter the highest value is usually represented by black and the lowest value represented by white. Low value areas correspond to a high density in the data and thus a cluster, while high values correspond to separations between clusters. It is beyond the scope of this paper to discuss the interpretation of U-matrices but it is inherently a subjective issue [19].

Component planes [13] are another SOM representation where each unit gets a colour based on the weight of each variable used in the analysis. Thus, a component plane exists for each variable showing the units’ weights on that variable.

Another possible view in the GeoSOM suite is the hit map [13]. This representation is usually superimposed on the U-matrix or on the component planes, and gives information about the number of data items represented by each unit (data items with the same BMU).

A parallel coordinates [22] plot is a data analysis technique for plotting multivariate data. This technique starts by drawing parallel axes for all variables. A line connecting values on each variable axis maps each data item.

3.2 Dynamically linked windows

This feature allows the selection of elements on a view providing the user with the matching selection on any other open view. Thus, if the user selects some features in a geographic map, the corresponding selection will be presented on the u-mat, the component planes and the parallel coordinate plots. The use of dynamically linked windows enables strong interaction with the data, allowing users to analyze data from different perspectives.

3.3 Clustering in GeoSOM suite

GeoSOM suite allows the users to define clusters on plots of U-matrices. Thus the user can draw the clusters according to his interpretation of the U-matrix, and label input data points accordingly. To help users define clusters two extra tools are available in the GeoSOM suite: hierarchical clustering of units and what we call a z-level tool. In the first case, we cluster the units based on distances using a hierarchical algorithm (single-linkage), and label each unit on the U-matrix. The z-level tool is a simple query on the U-matrix that highlights all the objects that are below a certain


threshold. If we consider the U-matrix in three dimensions (assuming as third dimension the distance between units) high-density areas correspond to the valleys while low-density areas correspond to mountains. Thus, clusters match the valley zones while the mountains are the cluster frontiers. After defining the clusters on top U-matrices, it is possible to visualize the clusters on all open views. Fig. 3 shows the GeoSOM suite clustering tools applied to building’s age data in Lisbon’s Metropolitan Area.

Fig. 3. GeoSOM suite clustering: defining the clusters on the U-matrix allows the user to create selections on the geographic map and component planes

3.4 Combining multiple clusters in GeoSOM suite

Another analysis possible with GeoSOM suite is to train several SOMs or GeoSOMs using the same dataset. This possibility has different applications. First, it allows the comparison between the “k-means” and “emergent” cluster methods using the same dataset. Therefore, the user can compare the clusters produced using a predefined number of clusters or searching for “natural” clusters.

This feature also allows a sensibility analysis of SOM and GeoSOM by comparing results using different input parameters. It also allows the comparison of SOM and GeoSOM algorithms in clustering. Using this possibility will give the user an insight on the geographic nature of data. Fig. 4 shows this comparison using the previously created GeoSOM (Fig. 3) and a SOM trained using the same buildings dataset.

Finally, the user might choose to do several SOM’s using different subsets of variables. This can be thought of as building different thematic classifications (e.g. for the census dataset it can be building characteristics, family characteristics, unemployment characteristics, or other) which can, in the end, be evaluated together.


Fig. 4. Comparison of a GeoSOM (bottom right) and SOM (bottom left) trained using buildings age data for the Lisbon Metropolitan Area. The top left section contains the tabular data, the top right the geographic map, and in the centre, we present some component planes with hit maps.

Selecting one cluster on the GeoSOM U-matrix (bottom right) it is possible to check in the regular SOM (bottom left), and in the other views, how are buildings distributed (they are

represented in red in all views).

4 Case study: Lisbon’s Metropolitan Buildings Age

To exemplify how to use the GeoSOM suite we analyzed data concerning the age of buildings in the Lisbon Metropolitan Area (LMA) using the census 2001 dataset. The data are aggregated in Enumeration Districts (secção estatística in Portuguese). The variables used in this study are shown in Table 1.

Table 1. Variables used in this case study.

Variable Description E1919 Percentage of houses built before 1919 E1945 Percentage of houses built between 1919 and 1945 E1960 Percentage of houses built between 1945 and 1960 E1970 Percentage of houses built between 1960 and 1970 E1980 Percentage of houses built between 1970 and 1980 E1985 Percentage of houses built between 1980 and 1985 E1990 Percentage of houses built between 1985 and 1990 E1995 Percentage of houses built between 1990 and 1995 E2001 Percentage of houses built between 1995 and 2001


We start this analysis by applying a 20x10 SOM to the dataset. The resulting U-matrix is shown in Fig. 5. Analysing the U-matrix we detect a main frontier separating two distinct main regions (represented by the red and green polygons)

Fig. 5. U-matrix produced from the SOM

A cross analysis of the U-Matrix and the component planes (Fig. 6) shows that the red cluster represents areas where most buildings were built after the 1970s while the green cluster represents areas where most buildings were built before 1970. This partition is related with Portuguese political history, and more specifically with the 1974 April revolution. This is a mark in the Portuguese history with major effects also in urban planning and thus in real estate development. A closer examination of the E1980 component plane (variable that represents the 1970s) shows high values for the two clusters, with a clear distinction between them. These two groups represent the buildings built in the first part of the decade (between 1970 and 1974) and after 1974 until 1980.

Fig. 6. Component planes produced from the SOM

Fig. 7 shows the geographic distribution of these two clusters. It is possible to conclude that after 1974 the areas with greater growth are generally located on the periphery of the oldest area of Lisbon and in some areas of the south margin of the Tagus river.


Fig. 7. Map with the regions with predominant buildings before 1974 (in green) and after 1974 (red)

However, the two main clusters formed are not homogeneous and several partitions are possible inside each one. We will now focus on the cluster of the oldest buildings.

Selecting, in the E1919 component plane (buildings before 1919), the units with higher weights it is possible (due to the dynamically linked windows) to get the correspondent cluster on the U-matrix.

Fig. 8. E1919 component plane and U-matrix showing the SOM units (in red) where buildings before 1919 are dominant

Fig. 9 shows the geographical distribution of this cluster. It is clear that the EDs forming this cluster are spatially apart in Lisbon’s area ranging from the old town to areas in the periphery, which are the historical centres of villages and palaces around Lisbon.


Fig. 9. Map showing the EDs with predominant buildings built before 1919

However, depending on the final objective, the creation of spatially homogeneous clusters can be an important goal in the cluster analysis. The clusters are thus formed not only by the time dimension (age of the buildings) but also by their spatial location. In this particular case, we can assume that the goal is to find spatially compact clusters with historical value for tourism infrastructures. For this purpose, we used GeoSOM. A 20x10 GeoSOM was trained and the resulting U-matrix and E1910 component plane are shown in Fig. 10. Again, by selecting, on the component plane, the units with higher values, the correspondent selection will be shown on the U-matrix and on the geographic map.


Fig. 10. GeoSOM E1919 component plane and U-matrix and geographic map with old

buildings cluster selection In this case, and as expected, the cluster is spatially continuous (corresponding to the old town region). Comparing the SOM and GeoSOM results it is possible to verify that the SOM created a cluster with old building EDs, even if they are geographically distant from each other. On the other hand, GeoSOM reframed the clustering perspective, by adding the space restriction. This produces a cluster with old building EDs close to each other.

It is also possible to verify by comparing both U-matrices (Fig. 11) that due to the space restriction, GeoSOM formed a cluster less homogeneous (in terms of non-spatial variables) than the cluster produced by SOM. This is the price to pay for including spatial reasoning in the analysis.

Fig. 11. Comparison of the SOM and GeoSOM U-matrices with oldest buildings cluster

selection


From this case study, we may draw some simple conclusions:

1. A major partition is detected with buildings built before and after 1974 2. When analysing in greater detail these main partitions, the SOM detects a

well defined cluster for EDs with predominance of buildings older than 1919. This cluster is not spatially homogeneous, presenting EDs on several regions of the Lisbon Metropolitan Area.

3. With the use of GeoSOM a spatial homogeneous cluster of ED with old buildings emerges, corresponding to the old town of Lisbon.

5 Conclusion

In this paper we presented the GeoSOM suite, a new and efficient tool for exploratory spatial data analysis (ESDA) and clustering. This tool is based on two major methods, the SOM [13] and the GeoSOM [17]. The SOM is a well-known algorithm that has proved to be of interest in spatial clustering. The GeoSOM, by explicitly embodying spatial autocorrelation, has the capability of detecting spatially homogeneous and heterogeneous areas. These heterogeneous areas are regions where, although spatial attributes are related (data points are close to each other) non-spatial attributes present small levels of correlation.

Several visualization features were implemented allowing an improved understanding of both methods and results, such as U-matrices, component planes and parallel coordinate plots. All these tools were implemented with dynamic links that allow an interactive exploration of both the SOM and GeoSOM clustering solutions, and more importantly the interaction between these two techniques.

A case study using data concerning the age of buildings in the Lisbon Metropolitan Area was presented, showing that the GeoSOM suite can be considered a valuable addition to the field of spatial analysis and in particular to spatial clustering in the context of multidimensional spaces.

Future research will focus on evaluating the performance of GeoSOM with factors like scale, zoning and time. Also, some research will be done on using GeoSOM to detect cluster in different thematic areas and to produce final clusters based on these clusters.

References

1. Han, J., Data Mining: Concepts and Techniques. 2005: Morgan Kaufmann Publishers Inc. 2. Jain, A.K., M.N. Murty, and P.J. Flynn, Data clustering: a review. ACM Comput. Surv.,

1999. 31(3): p. 264-323. 3. Kaufman, L.R., Peter J., Finding groups in data. an introduction to cluster analysis, ed.

A.P.a. Statistics. 1990, New York: Wiley Series in Probability and Mathematical Statistics. 4. Berkhin, P., A Survey of Clustering Data Mining Techniques, in Grouping

Multidimensional Data. 2006. p. 25-71.


5. Guha, S., R. Rastogi, and K. Shim, CURE: an efficient clustering algorithm for large databases, in Proceedings of the 1998 ACM SIGMOD international conference on Management of data. 1998, ACM: Seattle, Washington, United States.

6. Ng, R.T. and H. Jiawei, CLARANS: a method for clustering objects for spatial data mining. Knowledge and Data Engineering, IEEE Transactions on, 2002. 14(5): p. 1003-1016.

7. Hu, T. and S. Sung, Clustering spatial data with a hybrid EM approach. Pattern Analysis & Applications, 2005. 8(1): p. 139-148.

8. Sander, J., M. Ester, H.-P. Kriegel, and X. Xu, Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and Its Applications. Data Mining and Knowledge Discovery, 1998. 2(2): p. 169-194.

9. Zhou, A., S. Zhou, J. Cao, Y. Fan, and Y. Hu, Approaches for scaling DBSCAN algorithm to large spatial databases. J. Comput. Sci. Technol., 2000. 15(6): p. 509-526.

10. Moon, T.K., The expectation-maximization algorithm. Ieee Signal Processing Magazine, 1996. 13(6): p. 47-60.

11. Dempster, A.P., N.M. Laird, and D.B. Rubin, Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 1977. 39(1): p. 1-38.

12. Sheikholeslami, G., S. Chatterjee, and A. Zhang, WaveCluster: A Multi-Resolution Clustering Approach for Very Large Spatial Databases, in Proceedings of the 24rd International Conference on Very Large Data Bases. 1998, Morgan Kaufmann Publishers Inc.

13. Kohonen, T., Self-Organizing Maps. 3rd edition ed. 2001, Berlin: Springer. 14. Ultsch, A., Data mining and knowledge discovery with emergent self-organizing feature

maps for multivariate time series, in Kohonen maps, E. Oja and S. Kaski, Editors. 1999, Elsevier Science. p. 33-46.

15. Tobler, W.R., A Computer Movie Simulating Urban Growth in the Detroit Region. Economic Geography, 1970. 46: p. 234-240.

16. Anselin, L., Spatial Econometrics: Methods and Models (Studies in Operational Regional Science). 1988: Springer.

17. Bação, F., V. Lobo, and M. Painho, Applications of Different Self-Organizing Map Variants to Geographical Information Science Problems, in Self-Organising Maps: Applications in Geographic Information Science, P. Agarwal and A. Skupin, Editors. 2008. p. 214 pages.

18. Kohonen, T., Self-organizing formation of topologically correct feature maps. RecMap: rectangular map approximations, 1982. 43 (1): p. 59-69.

19. Ultsch, A. and H.P. Siemon. Kohonen´s self-organizing neural networks for exploratory data analysis. in Proceedings of the International Neural Network Conference. 1990. Paris: Kluwer.

20. Bação, F., V. Lobo, and M. Painho, The self-organizing map, the Geo-SOM and relevant variants for geosciences, in Computers and Geosciences. 2005, Elsevier. p. 155-163.

21. Vesanto, J., J. Himberg, E. Alhoniemi, and Parhankangas. Self-organizing map in Matlab: the SOM Toolbox. in Proceedings of the Matlab DSP Conference. 1999. Espoo, Finland: Comsol Oy.

22. Inselberg, A., The plane with parallel coordinates. The Visual Computer, 1985. 1(2): p. 69-91.

GeoSOM suite: a tool for spatial clustering · 4 Roberto Henriques1, Fernando Bação1, Victor...

Documents

Transcript of GeoSOM suite: a tool for spatial clustering · 4 Roberto Henriques1, Fernando Bação1, Victor...