Clustering neural data - KU Leuvenmaapc/master_theses... · 2013-08-26 · Clustering neural data...

Clustering neural data

Merijn Mestdagh

Thesis voorgedragen tot het behalenvan de graad van Master of Science

in de ingenieurswetenschappen:wiskundige ingenieurstechnieken

Promotoren:Prof. dr. ir. B. De Moor

Prof. dr. E. Yaksi

Academiejaar 2012 – 2013

Master of Science in de ingenieurswetenschappen:wiskundige ingenieurstechnieken

Clustering neural data

Merijn Mestdagh

Thesis voorgedragen tot het behalenvan de graad van Master of Science

in de ingenieurswetenschappen:wiskundige ingenieurstechnieken

Promotoren:Prof. dr. ir. B. De Moor

Prof. dr. E. Yaksi

Assessoren:Prof. dr. ir. K. Meerbergen

Prof. dr. ir. W. Michiels

Begeleiders:Dr. ir. O.M. Agudelo

Ir. P. DreesenIr. N. Verbeeck

Academiejaar 2012 – 2013

c© Copyright KU Leuven

Without written permission of the thesis supervisors and the author it is forbiddento reproduce or adapt in any form or by any means any part of this publication.Requests for obtaining the right to reproduce or utilize parts of this publicationshould be addressed to the Departement Computerwetenschappen, Celestijnenlaan200A bus 2402, B-3001 Heverlee, +32-16-327700 or by email [email protected].

A written permission of the thesis supervisors is also required to use the methods,products, schematics and programs described in this work for industrial or commercialuse, and for submitting this publication in scientific contests.

Zonder voorafgaande schriftelijke toestemming van zowel de promotoren als de auteuris overnemen, kopiëren, gebruiken of realiseren van deze uitgave of gedeelten ervanverboden. Voor aanvragen tot of informatie i.v.m. het overnemen en/of gebruiken/of realisatie van gedeelten uit deze publicatie, wend u tot het DepartementComputerwetenschappen, Celestijnenlaan 200A bus 2402, B-3001 Heverlee, +32-16-327700 of via e-mail [email protected].

Voorafgaande schriftelijke toestemming van de promotoren is eveneens vereist voor hetaanwenden van de in deze masterproef beschreven (originele) methoden, producten,schakelingen en programma’s voor industrieel of commercieel nut en voor de inzendingvan deze publicatie ter deelname aan wetenschappelijke prijzen of wedstrijden.

Preface

This has been a very interesting year where I was able to learn a lot. For thisexperience I would like to thank some people. First of all I would like to thank mypromotor prof. dr. Emre Yaksi for giving me the opportunity to work with suchinteresting data and for the insights he gave me into the contemporary research atIMEC and NERF. I would also like to thank ir. Carmen Diaz Verdugo for producingthe data and for here interesting comments at the meetings we had together.

Second I would like to thank my other promotor, prof. dr. ir. Bart De Moor, formaking this subject available.

Third, I would like to thank dr. ir. Oscar Mauricio Agudelo and ir. PhilippeDreesen for correcting my thesis and giving me great feedback. Especially I wouldlike to thank Mauricio for his excellent help throughout the year. He was alwayspositive and very motivating. The meetings with him were very helpful, he gave methe opportunity to discuss with him all the details in all the time I needed. AlsoPhilippe helped me a lot, certainly in the end where he gave me a lot of good writingtips.

I would also like to thank my assessors prof. dr. ir.Karl Meerbergen and prof.dr. ir. Wim Michiels and my third assistant ir. Nico Verbeeck for reading andcommenting on my thesis.

Of course I would also like to thanks my parents for their input and helpthroughout the year (and my whole life). They have always supported me.

Last but not least I would like to thank my friends and roommates Jeroen Aerts,ir. Jan Agten, Arne Herman, ir. Jelle Hoedemaekers and Jonas Steel for theirexcellent cooking throughout the year and for their tolerance for my night work,sometimes including loud music.

Merijn Mestdagh

i

Contents

Preface i

Abstract iv

List of Abbreviations and Symbols vi

1 Introduction 11.1 The raw data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Clustering time series . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Literature Review 52.1 Similar literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Methods used for human neural data, next to clustering . . . . . . . 52.3 Clustering methods used for human neural data . . . . . . . . . . . 72.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3 Methods 153.1 Similarity measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.2 Preliminary analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.3 K-means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.4 Hierarchical clustering . . . . . . . . . . . . . . . . . . . . . . . . . . 413.5 Spectral clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483.6 Fuzzy c-means clustering . . . . . . . . . . . . . . . . . . . . . . . . . 573.7 Neural gas algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 623.8 Independent Component Analysis . . . . . . . . . . . . . . . . . . . . 643.9 Spatial Coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4 Results 734.1 The external validation . . . . . . . . . . . . . . . . . . . . . . . . . 744.2 Comparison of the algorithms . . . . . . . . . . . . . . . . . . . . . . 794.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

5 Conclusion 855.1 The algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 855.2 Distance measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 865.3 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

A Paper 91

ii

Contents

Bibliography 99

iii

Abstract

In this thesis, the unsupervised learning of a new kind of data is discussed. Thisdata is created with recently developed techniques where one is able to measuresingle neural activity on a high spatial and temporal resolution. There is already ahistory of clustering brain data time series but such methods have never been usedon this new single-cell resolution data. In previous research there has not yet beena good comparative study for the clustering of similar time series, which comparesdistance measures, as well as algorithms and includes contemporary knowledge aboutclustering.

In this study, it is tried to find an optimal solution for the clustering of this data.First the k-means algorithm is used as a basic algorithm to compare with. Manyvariations of this k-means algorithm are tested. The variations include two differentdistance measures, correlation and Euclidean distance, and different preprocessingtechniques like filtering, normalization and outlier detection. This k-means algorithmis however also compared with other clustering algorithms: the fuzzy c-means algo-rithm, hierarchical clustering, the neural gas algorithm and the more contemporaryspectral clustering. Also in these algorithms, some different preprocessing tech-niques or distance measures are tried out. Because independent component analysistechniques are used a lot on this kind of data, also clusterings achieved with suchtechniques are discussed and compared. For every algorithm, all the parameters arecarefully tuned. For all the clustering algorithms, a suitable intern validity measureis used to decide on the optimal number of clusters.

The algorithms and variations where compared on three aspects. First, theirperformance in previous literature is assessed, second their computational cost ismeasured and third their performance is measured based on an external validationmeasure. It is expected that neurons which are clustered by their time series, arealso located close to each other in the brain. An advantage of the single-cell imagingtechniques is that one knows exactly where each neuron is located. Using thisinformation, a new coefficient is proposed to measure spatial connectivity.

The results were that the choice of the distance measure was far more importantthan the choice of the algorithm. The different algorithms failed to bring closelyas much variation as the different preprocessing techniques or distance measures.The best variations according to the spatial connectivity were the spectral clusteringalgorithm, the k-means clustering algorithm and the hierarchical clustering algorithm,all using the correlation as distance measure. There is however a larger differencebetween the algorithms in terms of computation cost. The spectral clustering

iv

Abstract

algorithm can be implemented with sparse matrices which provide a fast algorithmwith a low memory cost. This makes this algorithm the algorithm of choice, followedby the hierarchical clustering and the k-means algorithm. However also in theassessment of the computational cost, the distance measure is important. Variationsthat use the correlation measure are faster than the others.

v

List of Abbreviations andSymbols

Abbreviations

AR AutoregressiveARI Adjusted Rand IndexfMRI Functional Magnetic Resonance ImagingFN False NegativeFP False PositiveHCN High Correlation NeuronsICA Independent Component AnalysisLCN Low Correlation NeuronsPET Positron Emission TomographySC Spatial CoefficientSICA Spatial Independent Component AnalysisTICA Temporal Independent Component AnalysisTN True NegativeTP True Positive

vi

Abbreviations of algorithms

KMC K-means with correlationKME K-means with Euclidean distancesKMFBP K-means band pass filtered (with correlation)KMFLP K-means low-pass filtered (with correlation)AR K-means on the auto-regression parametersKMHCN K-mean on the high correlation neurons (with correlation)KMN K-means on the normalized data (with Euclidean distances)TICA Temporal independent component analysisSICA Spatial independent component analysisFUZN Fuzzy c-means on the normalized data (with Euclidean distances)FUZC Fuzzy c-means with correlationSRBFN Spectral clustering with RBF similarity on the normalized dataSRBF Spectral clustering with RBF similarity on the unnormalized dataSNN Spectral clustering with nearest neighborsSMNN Spectral clustering with mutual nearest neighborsHIER Hierarchical clusteringNG Neural gas algorithm

Symbols

~xi A vector or a time seriesx the mean of time series ~x~ci Cluster i~si Cluster center of cluster icor( ~x1, ~x2) Correlation between two time series.d( ~x1, ~x2) Distance between two time series.P () Probabilitys( ~x1, ~x2) Similarity between two time series.s SecondsI Identity matrixK Number of clustersN Number of neuronsNn Number of nearest neuronsT The length of a time series

vii

Chapter 1

Introduction

Unsupervised exploratory data analysis has been conducted many times on positronemission tomography (PET) data and functional magnetic resonance imaging (fMRI)data. These are two well known techniques that are used to study brain activitysince respectively 1950 and 1990. In this thesis, clustering methods are tested ona new kind of neural data. The main issue of this thesis will be to compare theseclustering algorithms and to propose the most suitable solutions.

1.1 The raw dataThe data is collected from the full forebrain of the zebrafish using new methodsdescribed by Ahrens et al. [1] and Panier et al. [55]. With this new functionalimaging techniques, measurements can be made on a single-cell resolution. The datais recorded from living zebra-fish that are undergoing multiple food odor stimuli [73].965 different neurons are measured with a temporal resolution of 2 Hz. It is expectedthat the brain of the zebra-fish will react to the stimuli, therefore the time series ofthe single neurons will also differ. It is very probable that the clustering of the brainis different in those two parts, therefore it is better to study the clustering of the twoparts apart. The total length of the measurement is 2204 time steps or 1102 secondsand the stimuli are roughly applied at time steps 714, 811, 911, 1009, 1109, 1207, 1306and 1404. To be able to compare the stimulus and the non-stimulus part the data-setis separated into two parts with the same length: the non-stimulus part from timestep 1 to 650 and the stimulus part from time step 651 to 1300. Other parts aredismissed for simplicity. The non-stimulus part after the clustering gave very similarclustering results as the non-stimulus part before the stimulus. The raw data can beseen on Figure 1.1, a lighter or more red color means that there is a stronger activity.Already in the raw data the influences of the stimuli can clearly be seen.

1.2 ClusteringTo get valuable information out of the raw data produced by the zebrafish, a profounddata analysis has to be done. There exist two broad ways of analysis. The first one,

1

1. Introduction

the confirmatory analysis, tries to confirm the validity of a hypothesis or of a model.The second, called exploratory data analysis, explores the data and tries to find newinformation [37]. This exploratory data analysis, also known as learning, can be splitagain. First, one has supervised learning. In such analysis one has two sets of data,x and y, and one tries to find a link between them: y = f(x). On the other hand, inunsupervised learning one has no such labels or outputs as y. The main goal is tofind a hidden structure in the data. However, one can never be sure whether theresulting structure has any valuable meaning because there is no ground truth [3].Between those two kinds of learning, also semi-supervised learning exists. Here onlya small amount of data x has labels y, but the largest part has not.

Unsupervised learning, also known as clustering, tries to find groups in data. Theobjective is to assign the objects in the data to the groups such that the similaritiesbetween objects of the same group are high while the similarities between objects ofdifferent groups are low. Humans are actually quite good in clustering, but if theinformation gets multidimensional, algorithms have to be designed to handle thistask.

Time steps

Neu

rons

Raw data

200 400 600 800 1000 1200 1400 1600 1800 2000 2200

100

200

300

400

500

600

700

800

900

0

20

40

60

80

100

Figure 1.1: The time series of all the individual neurons. The data is scaled sothat the maximum is 100 and the minimum is 0. The black lines represent where

the data is cut.

1.3 Clustering time series

The data from the zebrafish are time series. Liao describes that choosing a goodclustering for time series involves three important choices [41]. One needs to makea decision about the similarity measures, the algorithm and the cluster evaluationcriteria. It is important that especially the similarity measure and the algorithmget enough attention. In the third chapter about the methods, the theory and

2

1.3. Clustering time series

implementation details concerning this three choices will be explained. In the secondchapter similar previous work will be discussed. The algorithms are compared andevaluated in the fourth chapter. In the last chapter, a conclusion will be given and adecision will be made.

3

Chapter 2

Literature Review

2.1 Similar literature

Clustering of this new kind of neural data from the zebrafish, with high spatial andtemporal resolution, is to the best of the author’s knowledge, never been done before.On the other hand, there has been far too much extensive research about clusteringand its application to be covered in this literature review. However, there has alsobeen a lot of more specific research on dividing the brain into groups with differentspecific functions. This is mostly done with human data, using positron emissiontomography (PET) or functional magnetic resonance imaging (fMRI). Because thedata from the brain of a zebra-fish is quite similar to the data of the human brain,one can expect that methods that work on human data also work on zebra-fish data.In this chapter these methods will be discussed.

2.2 Methods used for human neural data, next toclustering

Before clustering techniques were applied to map the brain, other techniques withsimilar objectives were used. Some of these methods are still used today besidesclustering analysis.

2.2.1 Correlation analysis

Bandettini et al. [4] describe an early but commonly used way to analyse the fMRIdata . The time course of each voxel is correlated with a reference signal. When avoxel is important in the task the person is performing, the voxel will be correlatedwith the reference signal. Following this procedure, the parts in the brain that areimportant for the task can be identified. The reference signal is built in associationwith the task the person is performing. This is of course not an exploratory dataanalysis. A reference signal can be made for a person repetitively ticking with hisfinger. But, as Somarja et al. [65] note, ’a priori modelling is impossible in principlewhen the stimulus to be identified is not extraneously created, spontaneous and

5

2. Literature Review

non-generic (e.g., the task is to follow the behaviour of a patient with Tourettessyndrome, or the onset and course of epileptic seizures, or the consequences of drugtherapy, etc.)’ . For this data with well-defined odor stimuli, a reference signal canbe made. However, with this method one can only split the data in two parts: thepart that follows the reference signal and the part that does not. Additionally, themethods discussed in this thesis should also be able to cluster data generated byzebra-fish which suffer from epileptic seizures. For such seizures this correlationanalysis would become totally useless.

2.2.2 PCA and ICA

The first time exploratory data analysis was performed on data produced by a brain,principal component analysis (PCA) was used. This method is used to extract thefunctional patterns out of the data. PCA was first used on a PET data set by Moelleret al. [51] and later on fMRI data by Sychra et al. [69]. PCA transforms the data inscores of a number of orthogonal principal components. This transformation is doneso that the first component explains as much as possible of the variability of the data.The second component is orthogonal to the first component but still tries to explainas much as variability as possible, and so on. The problem is that the most variancemight not be explained by the task-related process or other interesting processes [47].Even more, a lot of variance is explained by instrumental or physiological noise [65].

Independent component analysis (ICA) tries to find the original sources froma mixture of sources. The analysis doesn’t work with orthogonal components butassumes that the underlying sources are independent. It also doesn’t try to explainas much variance as possible as the PCA and is therefore seen as a better optionfor this PET or fMRI data [12, 47, 65]. Calhoun et al. [13] give a summary of thenumerous applications using ICA in combination with fMRI data . They divide theICA into two parts: temporal ICA (TICA) and spatial ICA (SICA). Mckeown etal. [47] for example use SICA. They decompose the fMRI data into independentcomponent maps . The method is illustrated in figure 2.1. Biswal et al. [12] useTICA, they differentiate the nature of the different sources using ICA. Later theymake a map of the brain using correlation analysis . Liu et al. [42] comment thatalthough ICA is used more and more in resting state fMRI analysis ([39, 70, 12]),there is still no empirical evidence for the assumption of independent sources made byICA. Also Somorjai et al. argue that the assumption of spatial or temporal statisticalindependence is a limitation.

6

2.3. Clustering methods used for human neural data

Figure 2.1: The fMRI data is reconstructed as a mixture of spatial independentcomponents (component maps). The relative contribution at each time step is defined

by the matrix M [47]

2.3 Clustering methods used for human neural data

2.3.1 Fuzzy c-means

Baumgartner et al. [5] were in 1998 one of the first to use clustering as an exploratorydata analysis on fMRI data. They argued that, contrary to the correlation analysis,this method is unbiased: ’It identifies the actual rather than the expected responses’. The researches tried and succeeded to answer the question of where somethinghappened and what its temporal characteristics were. They used a fuzzy c-meansclustering algorithm for this purpose. They used the Euclidean distance as distancemeasure and argued that this was a good option because it is also able to differentiatelevels of activation. First they made an initial clustering and then reclustered theinteresting parts. It should be noted that the time series had only a length T of 35points.

One year later Baumgartner et al. [6] compared this fuzzy c-means clusteringalgorithm to the correlation analysis. They decided that the actual hemodynamicresponse function, which is sometimes hard to create, is not needed. The resultsfrom the fuzzy c-means algorithm, which does not need any prior information, werequite comparable to the results of the correlation analysis .

Also in 1998, Golay et al. [25] compared the Fuzzy c-means with the traditionalcorrelation analysis. First they compared the membership function from the algorithmwith the probability that a function correlated with a cluster center. With thisZ-score they could compute the probability. The Z-score has a Gaussian probabilitylaw distribution. To define if a time series belonged to a cluster they thresholded

7


the probability and the membership function. The result, the true positive fraction,was better with the Z-score, as with the membership function. This difference washowever not significant. In addition to the Euclidean distance, they also used twocorrelation based methods:

d21 =

(1− cor(~xi, ~sj)1 + cor(~xi, ~sj)

)β(2.1)

andd2

2 = 2(1− cor(~xi, ~sj)) (2.2)

They found that the distance d21 worked most consistent and that the Euclidean

distance underperformed. Distance measure d2 had also quite good results and doesnot need to estimate the extra parameter β.

2.3.2 Divisive hierarchical clustering

Filzmoser et al. [22] first proposed in 1999 a hierarchical clustering algorithm for theclustering of fMRI data . They argued that it is impossible to construct a similaritymatrix for all the time series because this would grow quadratically. Instead they useda top-down approach. They splitted the data always into two parts using k-meansclustering. After that they analysed the new parts and so on. They stopped theclustering when they thought that there was no more structure left in the data. Theyproposed various ways to do this (e.g., a visual inspection of the PCA). Subsequentlythey performed a merging of the clusters that were too similar. Finally they usedthe final cluster-centers as starting points for a k-means clustering of the wholedata-set. This way they had the advantages of the k-means clustering of having a fastclustering algorithm and a way of finding good initial cluster-centers and the numberof clusters. These days this method is not necessary any more, the computers of2013 provide enough space to store the similarity matrix. Even when this is not thecase, special data-mining methods such as BIRCH can be applied [74].

2.3.3 Feature extraction

Goutte et al. [27, 26] used features instead of raw time-series to cluster . They usedthe results of different kinds of tests performed by a correlation analysis to put in thefeature vector. This way Goutte et al. were able to preform kind of a meta analysisof correlation based methods. Because one of the reasons clustering is proposedin this thesis is that an unbiased method that does not need prior knowledge issearched, this method from Goutte et al. does not seem to have any benefits. Goutteet al. noticed themselves the need for prior knowledge. They argued that therewas a trade-off. One can either work with rich features and prior knowledge, orassumption-free features. Such assumption-free features could exist out of the resultsof a wavelet transform. The disadvantage of the use of raw data series could be thedimensions. When the time series become too long, the distances between time-seriescould become meaningless. Goutte et al. compares different information criteria

8

2.3. Clustering methods used for human neural data

(e.g., the Akaike information criterion) to decide the number of clusters. Howevera good validation measure for feature vectors is not necessary a good one for timeseries.

2.3.4 Fuzzy c-means versus k-means clustering

A lot of the research papers on clustering fMRI data use the k-means clusteringalgorithm, but maybe even more use the fuzzy c-means clustering algorithm. Theadvantage of fuzzy c-means is that it is less prone to converge to a local minimumtoo early. One can also argue that it is not so appropriate to use a hard division inbiological systems. It might be biologically incorrect to argue that the brain consistsout of totally distinct parts [72]. The advantage of k-means is that it has one lessparameter to estimate. In our case when we have no reference data or methodsthat do exactly the same, this could be a priority. Additionally, most fuzzy c-meansresults are thresholded. This way they are actually used as k-means results [7]. Ontop of that a k-means clustering with different clusters can be visualized more easyin one figure.

2.3.5 EVIDENT framework

Somorjai et al. [65] proposed a whole framework to handle the clustering . Theyused a fast fuzzy c-means method which includes a lot of preprocessing. First theynormalized the data. After this they did a preselection to exclude less interestingtime series. Time series with a significant trend were excluded from the clustering.This is because they may have a trend due to motion artefacts or instrumental drifts.Note that they also looked at non-linear trends. With autocorrelation they alsotested if the time series consisted of mainly noise or not. The time series with a toolow autocorrelation were excluded. They excluded a lot of time series so they couldhave a faster algorithm. Speed seemed to be very important in their framework.

2.3.6 Meta-analysis

In 2004 a comparison between the cluster analysises of fMRI data was published byDimitradou et al. [19]. They tried to compare the methods that were mostly used atthat time. To test the algorithms they used 2 performance coefficients. First theycalculated the correlation between the center of the activation class and the centerof the reference activation class. They also calculated a weighted Jaccard coefficient(wJC). This measurement ’should provide a quantitative measure of the quality ofthe activation cluster’ [19]. It is computed as follows:

wJC = TP + 1/P (TP )TP + 1/P (TP ) + FP + 1/P (FP ) + FN + 1/P (FN) (2.3)

With TP the number of true positives, TN the number of true negatives, FP thenumber of false positives and FN the number of false negatives. The function P definesthe probability. For the non-hierarchical methods the neural gas algorithm performed

9


best, closely followed by the k-means algorithm. The fuzzy c-means algorithm showednot to be a good option. The hierarchical algorithm with ward-linkage had a similaror even better performance. Because this algorithm was computationally heavyDimitradou et al. preferred the neural gas algorithm. However, this meta-analysismight be insufficient to draw conclusions. They only used one kind of distancemeasure (the Euclidean distance). As shown in [25], the fuzzy c-means algorithmperformed better when combined with correlation measures. In modern times, thecomputational argument against the hierarchical methods are not that importantany more.

2.3.7 Comparison of clustering and ICA

In the same year, Meyer Baese et al. [48] also performed kind of a meta-analysis.They compared 3 ICA methods with 3 clustering algorithms . They compared thetask-related activation maps with associated time-courses and receiver operatingcharacteristics. The biggest advantage of the ICA was that it was a faster algorithm.Besides that, the clustering algorithms mostly performed better. In particular, againthe neural gas algorithm, gave very good results.

2.3.8 Hierarchical clustering in resting state with single linkage

Cordes et al. [17] used a hierarchical clustering to analyse resting-state fMRI data.Particularly the low frequencies are interesting to research the connectivity in fMRIresting-state data. The method they used is an agglomerative hierarchical clusteringalgorithm with single linkage. First they computed the correlation between all thetime series. Then the time series that had barely any correlation with any other timeseries were excluded. This was done to make the algorithm computationally moreinteresting. Van De Ven et al. [72] argued that that this could bias the analysis. Forthe remaining time series, they use a special correlation that measures the correlationon low frequencies, to construct a similarity matrix. They used hierarchical clusteringwith single link distances. To define the number of clusters, they stopped the mergingbased on a consistency measure.

2.3.9 Short-time Fourier transform

Another way of looking at the low frequencies is too simply low-pass filter the timeseries. Mezer et al. [49] used this and transformed the time series using a short-timeFourier transform . This transformation gives information about the frequenciesas well as the time when these frequencies happen. Finally they use the k-meansalgorithm to cluster the data.

2.3.10 Including spatial information

An interesting way to use the spatial information from the time series was proposedby Chuang et al. [16]. They added information about the neighboring time seriesinto the membership function. This way the regions will be more homogeneous and

10

2.4. Conclusion

less noisy. Because the purpose of this thesis is to research a quite new kind of data,to make such spatial assumptions would not be appropriate.

2.3.11 Clustering of resting-state fMRI compared with theaggregation index

Also Liu et al. [42] used clustering to analyse resting-state fMRI data and to findresting-state-networks . Because of the resting-state analysis the data was band-pass filtered. They computed the Pearson correlation between all the time seriesand reformed these similarities to a distance measure with d(~x, ~y) = 1 − cor(~x, ~y).To compute the distance between clusters, average linkage was used. When thisdistance exceeded a predefined threshold, the data was not further merged. Also theclusters with less than eight time series were excluded because it was found unlikelythat they would represent meaningful spatial patterns. These two parameters wereestimated so that the algorithm would stay stable when the parameters changeda bit. Additionally they also used the same consistency measure as Cordes et al.[17]. Liu et al. used a interesting way to compare and evaluate their results. Theyused the aggregation index from He et al.[32]. This is given by the number of sharededges divided by the maximum number of shared edges, as illustrated on figure 2.2.It is believed that clusterings with a higher aggregation index are more meaningfulthen clusterings with spatially random distributed time series. This is of course notan accurate measure. The results from their hierarchical method was compared withthe results from an ICA analysis. The hierarchical method has significantly higheraggregation index values.

Figure 2.2: ’The left picture has an aggregation index of 1. The right picture hasan aggregation index of 16/24 = 0.67’ [42]

2.4 Conclusion

It should be clear that normal clustering should be preferred instead of PCA orICA analyses. In normal fMRI data analysis as well as in resting-state fMRI dataanalysis, ICA and PCA were outperformed by the clustering algorithms [42, 48]. Theneural gas algorithm with Euclidean distance performed quite well in the variousmeta-analysises [19, 48], but this is a rather unknown algorithm. It could be hard

11


to combine it with possible extensions that can be used to research time-varyingclusters (e.g., evolutionary clustering [14]).

Another algorithm that performed well in a meta-analysis was the hierarchicalclustering [42, 19]. In old times this algorithm was neglected or simplified because itwas computationally inferior to other algorithms [22], but these days it should bepossible to implement and use it correctly. It already is striking that almost all therecent algorithms for the resting-state fMRI data analysis use hierarchical clustering[17]. This algorithm should be used in combination with ward’s linkage or averagelinkage.

All these clustering algorithms, ICA, k-means, fuzzy c-means hierarchical clus-tering and neural gas clustering, will be tested on the zebra-fish data. There existof course many more clustering algorithms. However, it is hard enough to compareall the algorithms already used in similar problems. Therefore, certainly for othermethods, no useful comments can be made. Nevertheless, a lot of these articlesare more than ten years old and do not make use of the contemporary clusteringtechniques. Therefore, also a more recent algorithm, spectral clustering [44], will beadded to the comparison.

Concerning the distance measure, it can be noted that the hierarchical clusteringperformed well with the correlation similarity as well as with the euclidean distance[25, 42]. This was certainly not the case with fuzzy c-means [25]. It is interestingto note that only a few papers addressed this quite important question. The choicebetween the correlation and the Euclidean distance also depends on the importanceof the amplitudes of the time series. In this thesis, multiple distance measures willbe used and compared.

It is also interesting to note that it is not uncommon to preprocess the data inone way or another. Sometimes the data is normalized and even more often outliersare separated [65]. Outliers can be whole clusters with a low amount of members,time-series with a low autocorrelation or time-series that have no high correlationwith other time-series [42]. This was done to reduce the noise in the data and tomake the algorithm faster. However, one should always remember that valuableinformation can be lost while using these methods. In the following chapters, thosetechniques will also be tested and compared.

In the literature, no internal cluster validity measures were used to define thenumber of clusters for the partitioning clustering algorithms to cluster the fMRItime series. As far as the author knows, there exists no survey at all about validitymeasures used for clustering time series. The best option in this case is to usemultiple validity measures and to compare them. For hierarchical clustering, themethod of merging clusters until a parameter exceeds a certain threshold, is mostlyused. It is difficult to find an internal validation measure that can be used for everyalgorithm. Specific algorithms make use of specific measures. For every algorithm, asuitable measure will be used.

In two articles, also spatial data was included in the clustering process [42, 16].This information can also be used in the case of the zebra-fish. To be sure of a spatialarrangement of the clustering of the neurons and to enforce this by manipulating theclustering algorithms would not be appropriate. There is not enough information to

12

2.4. Conclusion

make such strong assumptions. However, the spatial arrangement of the clusteringscan be used as an external validation measure, as is done with the aggregation index.The spatial distribution of the neurons are however not suited to use this aggregationindex, therefore a new coefficient is proposed in the following chapter.

13

Chapter 3

Methods

In this chapter, the various methods will be discussed. At first different similaritymeasures will be considered. Those are needed for all the algorithms and can beused in a same way for each algorithm.

In the second part of this chapter there will be a preliminary analysis, this isneeded to estimate the problem difficulty and to get familiar with the data by tryingto visualize it. This analysis can give guidelines for how the clustering should betaken care of.

Subsequently all the algorithms will be discussed. This will be done using a fixedpattern. First the theory will be discussed, second the implementation or possiblevariations will be explained and third the tuning of some parameters will be tackled.The number of clusters is a typical parameter that needs to be tuned, but often alsovalues for other parameters must be decided. Fourth the speed and computationalneeds of the algorithms will be shortly discussed and finally some specific resultsfrom the algorithm are shown.

In the last part of this chapter the computation of an external validation measurewill be explained. The spatial information will be used to create a coefficient thatrates the spatial distribution of the clustered time series in the brain.

Everything is tested in MATLAB R2012a [46]. If an extra toolbox that is notstandard in MATLAB is required for a certain algorithm, this will be mentioned.

3.1 Similarity measuresClustering is about grouping together the objects (or time series) that are similarand segregate them from the objects that are distant. To achieve this, a distancemeasure or similarity measure is needed. Most algorithms need distance measures(e.g., k-means, hierarchical clustering,...) but sometimes it is more natural to workwith similarity measures (e.g., correlation analysis). A distance is supposed to fulfillthe following properties:

1. d(~x, ~y) > 0

2. d(~x, ~y) = 0 if and only if ~x = ~y

15

3. Methods

3. d(~x, ~y) = d(~y, ~x)

4. d(~x, ~y) + d(~y, ~z) ≥ d(~x, ~z)

According to T. Warren Liao [41] there are three different ways to compute asimilarity or distance between time series. These will be explained now.

3.1.1 Comparison of the raw time series

Euclidean distance

A first way to compare the time series is to just use the raw time series. The mostcommon example is the Euclidean distance. For time series ~x and ~y with length Tand time points xt the formula for this distance is given by:

dEuclidean(~x, ~y) =

√√√√ T∑t=1

(xt − yt)2 (3.1)

This distance can also be written as the L2 norm of the difference between the timeseries:

dEuclidean(~x, ~y) = ‖~x− ~y‖2 (3.2)

A well known problem concerning this distance is the curse of dimensionality. As onegets more and more dimensions it gets harder to see a difference between distances.In a single distribution, the relative difference between the maximum and minimumdistance between the objects converges to zero [9]:

limd→+∞

= dmax − dmindmin

(3.3)

This can be seen in Figure 3.1. One can almost say that this distance is notuseful anymore for high dimensions. This curse of dimensionality has of course a biginfluence on clustering. When all distances are almost the same, it is hard to findclusters. Another example is given on Figure 3.2. With one dimension the clusteringcould be easily performed. When a meaningless dimension is added, it is not so easyanymore to see two separate clusters. In real life the data is of course not randomand most dimensions will have a meaning. However, there will always be meaninglessnoise and the problem could get more difficult if more dimensions are added.

16

3.1. Similarity measures

100

101

102

103

10−1

100

101

102

103

curse of dimenstionality

(max

(dis

t)−

min

(dis

t))/

min

(dis

t)

dimensions

Figure 3.1: Effect of the curse of dimensionality: dmax−dmindmin

. The points are takenrandomly out of a uniform distribution.

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 20.9

0.92

0.94

0.96

0.98

1

1.02

1.04

1.06

1.08

1.1one meaningful dimension

dimension 10 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1one meaningless and one meaningful dimension

dimension 1

dim

ensi

on 2

Figure 3.2: Effect of the curse of dimensionality on clustering. On the left a dataset with one meaningful dimension and on the right a data set with a meaningful

and a meaningless dimension.

Correlation

Another way to compare raw time series is to compute the Pearson Correlation (withx and y the means of time series ~x and ~y respectively):

r(~x, ~y) =∑Tt=1(xt − x)(yt − y)√∑T

t=1(xt − x)2√∑T

t=1(yt − y)2(3.4)

It is a measure of linear dependence between the time series. If the time series arenot normalized the similarity between two time series can be big, even if one of thetime series has a much lower amplitude. This similarity measure does not take intoaccount how strong the reaction of a neuron is, or how high the amplitude of the timeseries is. Notice that this is quite similar to the Euclidean distance measure when

17

3. Methods

the time series are normalized. To normalize a time series, the following equation isapplied to each time step xt of the time series ~x:

xt,norm = xt − xSx

(3.5)

with Sx the standard deviation of the time series. The result is that the normalizedtime series possesses a zero mean and a standard deviation of one. The correlationbecomes then

r(~x, ~y) = ~x.~y

T(3.6)

and the Euclidean distance becomes

dEuclidean(~x, ~y) =

√√√√ T∑t=1

(xt − yt)2

=

√√√√ T∑t=1

(x2t + y2

t − 2xtyt)

=√

2T − 2~x.~y =√

2T√

1− r(~x, ~y)

(3.7)

Mostly 1 − r(~x, ~y) is taken as the distance measure. Note however, that thisdistance measure is no real distance measure when the time series are not normalized.Two different time series can still have a zero distance.

RBF kernel

Another similarity measure is the RBF kernel also known as the Gaussian similarityfunction. It is given by

srbf ((~x, ~y)) = e−‖~x−~y‖σ2 (3.8)

The σ is another parameter that one needs to estimate. This measure is particularlyinteresting for spectral clustering [44]. There one needs a measure that can be tunedin a way so that a neuron has a negligible similarity with most of the other neurons,but a high similarity measure with his ’nearest neighbors’.

Nn-nearest neighbors

Another way to do this is to make use of the Nn-nearest neighbors technique. In thiscase one has first to choose a number Nn. Afterwards one sets for each neuron thesimilarity to his Nn nearest neurons to 1 and the rest to 0. Note that the numberNn refers here to the number of nearest neighbors and not to the number of neuronsof the data set N . Of course one needs another similarity measure to define theNn-nearest neighbors.

18

3.2. Preliminary analysis

3.1.2 Clustering of processed time series

A second way is to extract features out of the time series. This features can forexample include the mean, the standard deviation and the autocorrelation of thetime series. After this extraction, a raw-data distance measure can be used on thefeature vector (e.g., the Euclidean distance).

A last, quite similar way, is to create a model for the time series and to use theparameters of this model as the features. This can also be used to filter the data,one can for example make an autoregressive model of each time series. Then one canchoose to compare the parameters of this model or again compare the time seriesthat is expected by this model.

3.1.3 Choice of the similarity measure

The literature is slightly in favour of the correlation (see Chapter 2). Using Euclideandistances would only make the problem more difficult. In that case, not only theform of the time series have to be the same but also the amplitude, to be consideredas similar. It will be interesting to compare the effects of these distance measures,since this had not yet been studied thoroughly. Because the literature review is infavour of the correlation, mostly the correlation will be used. Normalizing the datamakes the correlation quite similar to the Euclidean distance, therefore also thismethod is used a lot. Of course, also the Euclidean distance on the raw data, theNn-nearest neighbors method and the RBF kernel will be used for a comparison.

3.2 Preliminary analysis

It is important to first take a look at the data, for the purpose of making the rightdecisions later on. An important part of this preliminary analysis will be an attemptto visualize the data. First however, the data itself will be studied. Because theneurons will mainly be clustered on the basis of their correlation with other neurons,it is interesting to study these correlations. On Figure 3.3 a histogram of thesecorrelations is shown. These are all the correlations between all the neurons (exceptthe correlations of the neurons with themselves which would always be 1). Herethe first effect of the difference between the stimulus and the non-stimulus part canbe illustrated. In the area of lower correlations (−0.2 to 0.2) the most correlationsare found. In this area there are more correlations from the non-stimulus part.The stimulus part provides relatively more correlations above 0.2 and below −0.2compared to the non-stimulus part. The stimuli ensure that the neurons are morecorrelated, that the time series are more alike. There are also more correlationsabove zero than below. For the non-stimulus part, 82% are positive respectively to64% for the stimulus part. Because most of the correlations are positive means thatthe brain is working quite synchronous. However, in the stimulus part there areobviously also some strong opposite reactions between neurons. The main remarkstays however that most of the correlations are quite low and almost insignificant.

19

3. Methods

It is also interesting to look at the distribution of correlations of each individualneuron. Some neurons have barely any correlation with any other neurons, whileothers have a high correlation with more then 100 other neurons. This is illustratedin Figure 3.4. A correlation higher than 0.35 is called a high correlation. It isquestionable if neurons which have a high correlation with less than 5 other neuronscan be clustered in serious clusters. It is not the objective to find hundreds ofmini-clusters, but rather to divide the set of neurons in several big parts. It couldbe interesting to only study the neurons with high correlations with enough otherneurons. Therefore the high correlation neurons (HCN) are defined. These areneurons which have a high correlation with more than 20 other neurons. These canbe studied because it is expected that one can find more structure in those neurons.The other neurons are called low correlation neurons (LCN). For the non-stimuluspart there are 110 HCN and for the stimulus part there are 236 HCN.

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 10

2

4

6

8

10

12

14x 10

4 histogram of all correlations

correlation

num

ber

of a

ppea

ranc

es

non−stimulus partstimulus part

Figure 3.3: The histogram of the correlations of the neurons. Most correlationsare quite small, especially in the non-stimulus part.

20


>100 50−100 35−50 20−35 5−20 <50

100

200

300

400

500

600

700

800

number of neurons a neuron has a good correlation with

pres

ence

of n

euro

ns

Distribution of correlations

no stimulusstimulus

Figure 3.4: The distribution of correlations of every neuron is checked. For everyneuron it is counted with how many it has a high correlation (> 0.35). For examplein the stimulus part, there are about 100 neurons which have 50 up to 100 other

neurons with which they have a high correlation.

3.2.1 Principal component analysis

It can be very informative to visualize the data in a way that the structure betweenthe neurons is (partly) kept. Principal component analysis offers such features. PCAtransforms the data in a number of scores on principal components. The morecomponents, the more variance of the data can be explained. In this preliminaryanalysis, three components are kept which is ideal to visualize in three dimensions.Every neuron or time series gets a score on one of these components and this scorescan be plotted and clustered with the k-means algorithm. Before the PCA is applied,all the time series are normalized.

The non-stimulus part already calls for the removal of the LCN. Without theremoval there is absolutely no structure and clustering results give two cluster centersthat are very alike (see Figure 3.5). It is however important to note that the threecomponents can only explain 20% of the variance. Looking only at the HCN, 55%can be explained. This is higher, partly because there are less neurons, but alsobecause there is more clear structure in these HCN. The result can be seen in Figure3.6. The data is split in two parts: the neurons which don’t have any clear timeseries, and the part which have a high score on some principal components and showpeaks in the time domain. Also the stimulus part gives better results with only theHCN, however the difference is less drastic.

21

3. Methods

−1000

−500

0

500

1000

−1000

0

1000

2000

−400

−300

−200

−100

0

100

200

300

X

PCA scores of the non−stimulus part

Y

Z

0 200 400 600 800−20

−15

−10

−5

0

5

10

time steps

mea

n of

clu

ster

Cluster centers of the non−stimulus part

cluster 1 centercluster 2 center

Figure 3.5: On the left: the PCA scores of the non-stimulus part. On the right:the means of the time series of the two clusters. There is not really a clustering

structure visible and the centers of both clusters are very similar.

−500

0

500

1000

−1000

0

1000

−600

−400

−200

0

200

400

X

PCA scores of the HCN of the non−stimulus part

Y

Z

0 200 400 600 800−40

−20

0

20

40

60

80

100

120

time steps

mea

n of

clu

ster

Cluster centers of the HCN of the non−stimulus part

cluster 1 centercluster 2 center

Figure 3.6: On the left: the PCA scores of the non-stimulus part of the HCN. Onthe right: the means of the time series of the two clusters. The clustering is better

compared to Figure 3.5.

3.2.2 Multidimensional scaling

Another way to visualize the data is multidimensional scaling [62]. Using this scalingone can convert a similarity matrix to new points. These points will have coordinatesthat will approximately create such a similarity matrix. The correlation matrix is an

22


obvious choice for the similarity matrix. With all the neurons, it is hard to see anystructure in the data from the stimulus part. This can be seen in Figure 3.7. Again,when using only the HCN, it becomes possible to see clusters in the data. Visuallyit looks like there are four clusters. This is shown on Figure 3.8. When one looksat the eigenvalues returned by the multidimensional scaling it becomes clear that 3coordinates are not enough to accurately represent the similarity or distance matrix.However, they should give a good approximation.

3.2.3 Preliminary conclusion

From this preliminary analysis three things can be learned. First of all it is clearthat most of the correlations are rather low. This means that it can be hard to findany decent structure. If neurons with a correlation of 0.05 with each other have tobe grouped in the same cluster it is questionable if one can call it a tight cluster.This could also be seen in the PCA and in the multidimensional scaling. It was hardto see any clusters if all the neurons were used. A second reason for this failure tovisualize any clusters could be that three components or coordinates are not enoughto represent the data appropriately. The second conclusion is therefore that it isprobably a bad idea to do data compression. Last but not least, when using onlythe HCN it is possible to see clear clustering structures. After visual inspection onewould expect two clusters for the non-stimulus part and four clusters for the stimuluspart. This information was clearly clouded by the LCN.

−1

−0.5

0

0.5

1

−1

0

1−1

−0.5

0

0.5

1

X

Multidimension scaling coordinates of the stimulus part

Y

Z

0 200 400 600 800−1

−0.5

0

0.5

1

1.5

2

2.5

3

time steps

mea

n of

clu

ster

Cluster centers of the stimulus part

cluster 1 centercluster 2 centercluster 3 centercluster 4 center

Figure 3.7: On the left: the multidimensional scaling coordinates of the stimuluspart. On the right: the means of the time series of the four clusters after clustering

with k-means. There is not really a clustering structure visible.

23

3. Methods

−1

−0.5

0

0.5

1

−1

−0.5

0

0.5

1

−1

−0.5

0

0.5

1

X

Multidimension scaling coordinates of the HCN of the stimulus part

Y

Z

0 200 400 600 800−2

−1

0

1

2

3

4

5

time steps

mea

n of

clu

ster

Cluster centers of the HCN of the stimulus part

cluster 1 centercluster 2 centercluster 3 centercluster 4 center

Figure 3.8: On the left: the multidimensional scaling coordinates of the stimuluspart of the HCN. On the right: the means of the time series of the four clusters after

clustering with k-means. One can clearly see a clustering structure.

3.2.4 Preliminary conclusion

From this preliminary analysis three things can be learned. First of all it is clearthat most of the correlations are rather low. This means that it can be hard to findany decent structure. If neurons with a correlation of 0.05 with each other have tobe grouped in the same cluster it is questionable if one can call it a tight cluster.This could also be seen in the PCA and in the multidimensional scaling. It was hardto see any clusters if all the neurons were used. A second reason for this failure tovisualize any clusters could be that three components or coordinates are not enoughto represent the data appropriately. The second conclusion is therefore that it isprobably a bad idea to do data compression. Last but not least, when using onlythe HCN it is possible to see clear clustering structures. After visual inspection onewould expect two clusters for the non-stimulus part and four clusters for the stimuluspart. This information was clearly clouded by the LCN.

3.3 K-means

There exist many different clustering algorithms. A lot of algorithms can be dividedin hierarchical or partitional clustering algorithms. In partitioning clustering therewill be, as expected, no hierarchical structure. All clusters are created simultaneously.A well known clustering algorithm is k-means. Steinhaus [68] was the first to describethis method in 1956 , although he did not call it k-means. Despite the fact thatthe algorithm is more than 50 years old it is still one of the most popular clusteringalgorithms [37].

24

3.3. K-means

3.3.1 Theory

The inputs are the T-dimensional time series and the expected number of clusters, K.The classical k-means algorithm, tries to minimize the following objective function:

J(C) =K∑k=1

∑~xi∈Ck

d(~xi − ~sk)2 (3.9)

Where Ck is the set of neurons that belong to cluster k. Unfortunately, this problemis NP-hard [20]. To find a local minimum the algorithm has the following steps:

1. Select an initial set of K cluster-centers and repeat step 2 and 3 until thealgorithm converges or the number of iterations exceeds a predefined threshold.

2. Divide the objects into K clusters. Each object i is added to the cluster k ifd(~xi, ~sk) ≤ d(~xi, ~sj), ∀1 ≤ j ≤ K

3. Compute new cluster centers: ~sk = 1nk

∑~xi∈Ck ~xi.

The most commonly used distance is the Euclidean distance. However, (1 −cor(~xi, ~xj)) will also be used here. After this first phase, it is still possible that singlereassignments of neurons can improve the costfunction 3.9 [66]. These reassignments,phase two of the algorithm, will ensure a local minimum. The time-complexity ofthe algorithm is O(IKNT ) where I is the maximum number of iterations, K isthe number of clusters, N is the number of objects and T is the length of the timeseries. It is linear in all the important variables [3]. The algorithm also does notneed to store the huge similarity matrix the hierarchical clustering algorithms willneed. Unfortunately, this algorithm also has some disadvantages. It converges to alocal minimum and when the initial cluster centers are poorly chosen the result canbe bad. Of course one can compute multiple replicates with random starting centersand choose the replicate which has the best result regarding to equation 3.9.

Validation

Choosing the number of clusters is an important and difficult task. There existdifferent validation measures ideally suited for a hard clustering method like thek-means algorithm. First interesting measures are the Dunn [21] and Dunn-likeindices [54]. Let the distance between two clusters, d(Ci, Cj) be given by the singlelink formula

d(Ca, Cb) = min(d(~x, ~y)|~x ∈ Ca, ~y ∈ Cb) (3.10)

and let the diameter of a cluster Ck be defined as

dia(Ck) = max~x,~y∈C

d(~x, ~y) (3.11)

Then Dunn’s index for K clusters is given by:

25

3. Methods

DunnK = mini=1,...,K

{min

j=i+1,...,K( d(Ci, Cj)maxl=1,...,K dia(Cl)

)}

(3.12)

This measure does not show any trend compared to the number of clusters. Itis valid to just search for the highest value (see Figure 3.9). However it is quitesensitive to noise. Outliers can have a great effect on the denominator.

Another set of indices are the Davies-Bouldin [18] and Davies-Bouldin-like indices[54]. When dis(Ci) is a measure for the dispersion of a cluster (e.g., the spreadaround the cluster center, the average distance of the neurons of this cluster to thecenter of this cluster) and

Rij = dis(Ci) + dis(Cj)d(Ci, Cj)

(3.13)

then the Davies-Bouldin index is given by

DBK = 1K

K∑i=1

maxj=1,...,K,j 6=i

Rij (3.14)

The number of clusters, K, that makes this measure minimal should be theoptimal number. This can be seen in Figure 3.9.

Also Hubert’s normalized Γ can be used. Hubert’s normalized Γ computes thecorrelation or similarity between two square matrices X and Y of the same size [34]:∑N

i=1∑Nj=i+1(X(i, j)− µX)(Y (i, j)− µY )

MσXσY(3.15)

where

µX = 1M

N∑i=1

N∑j=i+1

X(i, j) (3.16)

and

σX =

√√√√ 1M

N∑i=1

N∑j=i+1

(X(i, j)− µX)2 (3.17)

M is given by N(N − 1)/2. To use this measure, two matrices are created. Thetwo matrices to compare in this case are the distance matrix D and the matrix Q.Q(i, j) is given by the distance between the cluster centers of the clusters of xi andxj . For example, if xi and xj are clustered in the same group, Q(i, j) will be zero.The larger Hubert’s normalized Γ, the better the clustering. After a while, increasingthe number of clusters will no longer benefit this measure very much. This can beseen in a knee in Figure 3.9.

A last measure is the silhouette method [60]. This method is able to give everyneuron a value. The mean of all these values gives an indication of the goodness ofthe clustering. When ~xi ∈ Ck then a(~xi) is given by the average distance between

26

3.3. K-means

~xi and the other neurons of cluster Ck. Also let d(~xi, Cj) be the average distancebetween ~xi and all the neurons of Cj . And let

b(~xi) = minj, ~xi /∈Cj

(d(~xi, Cj)) (3.18)

Then is the silhouette value from ~xi given by

b(~xi)− a(~xi)max(a(~xi), b(~xi))

(3.19)

A value of 1 means that the neuron is perfectly clustered. Very close to all theother neurons in the same cluster and distant to all those in other clusters. A valueof −1 means that the neuron is probably in the wrong cluster. It is interesting tolook at all those values together in a silhouette plot. However, when clusterings haveto be compared it is impossible to assess the quality visually one plot against another.However, the mean value already gives a good indication. The optimal number ofclusters is given by the highest value (see Figure 3.9).

0 10 200.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9Mean silhouette value

0 10 200.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

0.22

0.24Dunn index

0 10 200.5

1

1.5

2Davies Bouldin index

Number of clusters

0 10 20

0.4

0.5

0.6

0.7

0.8

0.9

1Hubert normalized gamma

Figure 3.9: Validation measures after a k-means clustering of a set of 5 Gaussiandistributions with different means.

3.3.2 Variations

The clustering algorithm is an important choice for the clustering process, however itis not the only choice. Many steps of preprocessing or different choices of similaritymeasures can have drastic effects on the clustering. Because the k-means algorithmis the best known algorithm a lot of variations are tested on it.

27

3. Methods

The basics: correlations To compare the other variations with, the normalk-means algorithm will be used with

√1− cor(~xi, ~xj) as the distance measure. Note

that this function gets squared in the algorithm (see equation 3.9). There will be nopreprocessing and all the neurons will be used.

Euclidean distances As already explained, the distance measure is not a straight-forward choice. It can be interesting to compare the results achieved with Euclideandistances as distance measure with those of correlation as distance measure.

Normalized Euclidean distances This variation will be a compromise betweenthe previous two. It will have the features of the well known k-means algorithmwith Euclidean distances, however the data is normalized before the algorithm doesits work. As already explained, the correlation of normalized data resembles theEuclidean distances in normalized data. So in the beginning, both algorithms behavethe same. However, in the algorithm with correlations, the centers keep beingnormalized every step. This is not the case with this algorithm and when the centersare not normalized, also the Euclidean distance measure differs from the correlationmeasure.

No outliers In the preliminary analysis it became clear that the clustering structurecould be clouded by the LCN. In this variations the LCN are treated as outliers andremoved. This way the clustering structure should be more apparent. When theclustering results are compared with each other the outliers need to be reassigned.In that case each outlier is assigned to the closest cluster center.

√1− cor(~xi, ~xj) is

used as the distance measure

Low-pass filtered There is also the possibility that the higher frequencies onlymeasure noise. Therefore the data is first scaled, then low-pass filtered and then theK-means algorithm with correlations is executed. For this filtering a fourth orderzero-phase forward Butterworth filter is used.

Autoregressive model Another possibility to filter is to create an autoregressive(AR) model. Only the most important parts are kept. After this there are two waysto continue. First, one can reconstruct the time series and calculate the k-meanswith correlations on these filtered time series. This construction could resemble verymuch to the previous, low-pass filtered one, although this depends for example onthe order of the model. Another possibility is to run the k-means algorithm (withEuclidean distances) on the AR-coefficients. This last option will chosen here.

Band-pass filtered A last filtering possibility is to apply band pass filtering.Maybe only a small band of frequencies is interesting for the clustering. Thereforea band-pass filter is applied. For this filtering an eight order zero-phase forwardButterworth filter is used.

28

3.3. K-means

3.3.3 Tuning

For all these variations there are a lot of free parameters that have to be tuned andchoices that have to be made.

Number of replicates

A first choice is the number of runs, the number of times the algorithm is replicatedwith random starting centers. The K-means algorithm only converges to a localminimum and obviously one wants to find the best local minimum (or even theglobal minimum). Different local minima are found when different random startingcenters are used. The more replicates, the more probable it is that one finds theoptimal solution. To test how many replicates are needed the following setup isused. Because there is no ground truth, and the global optimum is not known it isimpossible to assess how far the algorithm has converged to this optimum. Thereforethe global optimum is estimated by a computationally very heavy method: 20000replicates. It is impossible to use this approach for every number of clusters andevery variation, but it is very likely that the same results will yield for the otheroptions. This test is done for 2 and 6 clusters and for the stimulus as well as forthe non-stimulus part. To compare the clusterings with the estimated ground truth,the Rand index is used [57] (other possibilities are the Jaccard coefficient [36] or theThe Fowlkes and Mallows index is [23]). There are different ways two objects can beclustered in the real versus the reference clustering. First, they can be in the samegroup or partition in the real as in the reference clustering. SS is the number oftimes this occurs. DD is the number of times two objects are in different clusters inboth partitions. M is the the number of pairs of objects (N(N − 1)/2). The Randindex is then given by

R(A,B) = SS +DD

M(3.20)

When this index is corrected for chance it is called the Adjusted Rand Index (ARI)[33]. This one is calculated as follows. Given 2 clusters C1,i and C2,j , respectively ofclusterings C1 and C2, then the overlap nij between those two clusters is given by

nij = |C1,i ∩ C2,j | (3.21)

ai is given byai =

∑j

nij (3.22)

and bi bybj =

∑i

nij (3.23)

One can then finally compute the ARI by

ARI =

∑ij

(nij2)−∑

i (ai2 )∑

j (bi2 )(N2 )∑

i (air )+∑

j (bi2 )2 −

∑i (air )

∑j (bi2 )

(N2 )

(3.24)

29

3. Methods

Remember that N is the total number of neurons. The ARI can have a valuebetween −1 and 1. This adjusted version will be used here.

First of all it is interesting to compare the algorithm with 1 replicate with theestimated ground truth. The algorithm is run 100 times for all the set-ups. Thepercentage of times that the outcome was exactly the same as the estimated groundtruth is given in following table 3.1.

Clustering

Number of clusters Part Percent exact runs

2 non-stimulus 100%stimulus 70%


Table 3.1: Percent exact runs when 1 replicate is used.

There is no hard choice for the division in two clusters for the non-stimulus part,all the runs give an exact outcome. However for 6 clusters in the non-stimulus part nosingle output gives the same results. A possible explanation is that there are clearly 2clusters which are found every run when 2 clusters are searched. However, when thealgorithm looks for 6 clusters it divides the neurons in semi random partitions whichare obviously not the same as the estimated ground truth. For the stimulus partthe results are less extreme. With this information one can also make predictionsabout the future. If there is a chance p of finding the exact truth there is a chance of1− (1−p)r of finding this truth minimal one time after r replicates. For 10 replicatesone would estimate a probability of 0.46 for the stimulus part with 6 clusters. Thereal results are close to this estimates and found in the following Table 3.2.

Clustering

Number of clusters part Percent exact runs



Table 3.2: Percent exact runs when 10 replicate is used.

It is clear that more replicates ensure that the result is the ground truth for mostof the setups, nevertheless for the non-stimulus setup with 6 clusters the replicatesdon’t really seem to help. However, it is not really necessary to have an exact result.When it resembles the estimated ground truth it would already be quite nice. For thenon-stimulus part, the minimum and mean rand index with 1, 10 and 200 replicatesare computed. The results are shown in the following Table 3.3.

30

3.3. K-means

Number of replicates minimum rand index mean rand index

1 0.31 0.6110 0.67 0.83200 0.86 0.92

Table 3.3: ARI with different number of replicates, 6 clusters and the non-stimuluspart.

The rand index gets better and better. One could say that the algorithm is quiteindecisive for this setup, maybe because there aren’t any clear clusters to find. It isquestionable if it is necessary that the algorithm finds the ground truth in such cases,because this truth is not that much worth anyway. To still approach this groundtruth, for every variation and set up, 200 replicates will be used in the future. Thiswill ensure that a good clustering is found in the case of an easy set up and that theclustering will approach the possible ground truth. Of course, more replicates wouldbe better but would compromise the computational time.

More would of course be better but would jeopardize the computational speed.Taking less replicates would be a risk. One has to make a trade-off between certaintyand cost.

Note also that the second phase, the phase of the individual reassignments, isvery important here. The high percentages of exact replications with 2 clusters(Table 3.2) would decrease to as much as 0 without the second phase.

3.3.4 Other parameters

For the other parameters internal cluster validation measures are used. There is nogood comparison between these values in the literature about clustering time series.It is therefore hard to choose one. First the number of clusters of the algorithmswithout extra parameters is assessed. In this case all the validation measures arecompared. For later use the silhouette value is chosen. This value gives the mostcomplete evaluation of the clustering because the distance between every two neuronsis assessed. It would be confusing to use all four validation measures for every smallchoice.

For the low-pass filter variation the cut off frequency is chosen manually. It itexpected that the peaks in the data are important for the clustering. The cut offfrequency is chosen as such that the peaks are kept, but that the rest is cut off asmuch as possible. This value was found at 0.25 Hz. In Figure 3.10 The different meansilhouette values of the correlation clustering, the correlation clustering with onlythe HCN, the clustering with the Euclidean distances, the clustering with Euclideandistances after normalization and the low-pass filtered clustering is displayed. Afirst observation would be that the optimal number of clusters is 2 in most of thevariations, in the stimulus as well as in the non-stimulus part. Only the Euclideandistance variation of the non-stimulus part asks for 3 clusters. A second observationis that the difference between 2 clusters and other options is smaller in the stimulus

31

3. Methods

part for the algorithms that use correlation. The mean silhouette value seems to beless decisive for the stimulus part. It is also interesting to note that the silhouettevalues are for each variation higher in the stimulus than in the non-stimulus part.There seems to be more clustering structure in the stimulus part.

For the comparison between the variations one has to be careful with conclusions.At first the variation with the Euclidean distance seems to be the best for 2 clustersin the stimulus part, this variation uses however another distance measure and istherefore hard to compare. The silhouette value is not really meant to compareclusterings of different data. For example when one has one Gaussian distributionand calculates the silhouette values after a clustering into 2 clusters it would give abad result. However, when the sign-function is applied to this Gaussian data withzero mean, the data splits up in −1 or 1 which would give an ideal silhouette valuefor 2 clusters. This clustering would however be a bad clustering looking at theoriginal data. Therefore, the silhouette value can not compare clusterings of differentdata, or data that are differently preprocessed. Such preprocessing might bring moreclustering structure, but that does not mean that the new found clustering is reallybetter.

2 4 6 8 10 12 14 16 18 20−0.05

0

0.05

0.1

0.15

0.2

0.25

0.3Mean Silhouette

number of clusters

inde

x

non−stimulus basicnon−stimulus no outliersnon−stimulus euclidean distancenon−stimulus low−pass filterednon−stimulus normalized euclidean

2 4 6 8 10 12 14 16 18 200

0.1

0.2

0.3

0.4

0.5

number of clusters

inde

x

stimulus basicstimulus no outliersstimulus euclidean distancestimulus low−pass filteredstimulus normalized euclidean

Figure 3.10: The mean silhouette values against the number of clusters. Thissilhouette value is clearly in favor of two clusters.

32

3.3. K-means

Additionally, when one looks for example at the individual silhouette values(Figure 3.11) one can see that the clustering with the Euclidean distance isn’t thatgood at all. The variation that uses correlations has positive silhouette values foralmost all the neurons. The Euclidean distance variance has however one goodcluster and one very bad cluster. The clustering has divided the data in a big goodcluster and a small bad one. The negative values mean that the neurons are probablyclustered wrong.

One can however compare the basic clustering and the clustering with the HCN.Here one finds that the HCN provide a better clustering structure which means thatthe LCN cloud the clustering.

The Davies-Bouldin indexes are seen in Figure 3.12, the Dunn indexes in Figure3.13 and Hubert’s normalized gammas in Figure 3.14. They all give similar results.The optimal number of clusters in the non-stimulus part is summarized in Table 3.4and of the stimulus part in Table 3.5.

0 0.2 0.4 0.6 0.8 1

1

2

Silhouette Value

Clu

ster

Silhouette values correlation

0 0.5 1

1

2

Silhouette Value

Clu

ster

Silhouette values euclidean distance

Figure 3.11: All the silhouette values for 2 clusters for the stimulus part. On theleft side the basic k-means clustering with correlations is used and on the right sidethe k-means clustering with Euclidean distances is used. The clustering with the

correlation gives a more balanced clustering with less negative values.

33

3. Methods

Variation

Validation measure Basic HCN Low-pass filtered Euclidean Normalized Euclidean

Silhouette 2 2 2 3 2Davies-Bouldin 2 2 3 19 2Dunn 3 2 3 18 3Hubert 3 3 3 ? ?

Table 3.4: Optimal number of clusters in the non-stimulus part. The questionmarks indicate that is impossible to notice a knee in Figure 3.14.

Variation

Validation measure Basic HCN Low-pass filtered Euclidean Normalized Euclidean

Silhouette 2 2 2 2 2Davies-Bouldin 2 2 2 2 4Dunn 2 2 2 12 2Hubert 5 3 5 ? ?

Table 3.5: Optimal number of clusters in the stimulus part. The question marksindicate that is impossible to notice a knee in Figure 3.14.

2 4 6 8 10 12 14 16 18 201

2

3

4

5

6

7

8Davies Bouldin

number of clusters

inde

x


2 4 6 8 10 12 14 16 18 200

1

2

3

4

5

6

number of clusters

inde

x


Figure 3.12: Davies-Bouldin index of different number of clusters. Two is mostlythe optimal number of clusters.

34

3.3. K-means

2 4 6 8 10 12 14 16 18 200

0.1

0.2

0.3

0.4

0.5

0.6

0.7Dunn index

number of clusters

inde

x


2 4 6 8 10 12 14 16 18 200

0.1

0.2

0.3

0.4

0.5

0.6

0.7

number of clusters

inde

x


Figure 3.13: Dunn’s index of different number of clusters. Three is mostly theoptimal number of clusters for the non-stimulus part and two is mostly the optimal

number of clusters for the stimulus part.

2 4 6 8 10 12 14 16 18 20−0.5

0

0.5

1Huberts normalized gamma

number of clusters

inde

x


2 4 6 8 10 12 14 16 18 20−0.2

0

0.2

0.4

0.6

0.8

1

1.2

number of clusters

inde

x


Figure 3.14: Hubert’s normalized gamma of different number of clusters. Three ismostly the optimal number of clusters for the non-stimulus part and five is mostlythe optimal number of clusters for the stimulus part. It is impossible to see a knee

for the variations that use the Euclidean distance.

35

3. Methods

Most of the validation measures give similar results. Clearly the optimal numberof clusters must be somewhere between 2 and 5 for all the variations. Hubert’snormalized gamma can of course never give 2 as optimal number because it isimpossible to see a knee there. This can be the reason why one can not detect a kneeat all for the clusterings that use the Euclidean distance. Therefore the questionmark. The algorithm with Euclidean distances without normalization also performsstrange with other validation measures.

For the band-pass filtering also the relevant frequencies have to be decided. Aband of 0.1Hz will be chosen. The mean silhouette values are computed for severalfrequencies and for several number of clusters. This is shown in Figure 3.15. Thesilhouettte values for the lowest frequencies are clearly highest, however this doesnot mean that this band gives the best clustering, because the data is differentlypreprocessed. It just means that the data is transformed in a way that can beclustered better. This low band would also just be a low-pass filter which alreadyis a variation. The band between 0.1Hz and 0.2Hz however still gives quite goodsilhouette values (for example compared to the silhouette values of the non-stimuluspart of the basic clustering). It will be interesting to compare this clustering withthe other variations. The optimal number of clusters is 2 for the lower frequencies,however in the higher frequencies a higher number is more optimal.

0

0.5

1 05

1015

20

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

number of clusters

non−stimulus part

start of band pass (Hz)

mea

n si

lhou

ette

val

ue

0

0.2

0.4

0.6

0.8

0

5

10

15

20

0

0.05

0.1

0.15

0.2

0.25

start of band pass (Hz)

stimulus part

number of clusters

mea

n si

lhou

ette

val

ue

Figure 3.15: The mean silhouette values compared between different frequenciesand number of clusters. Lower frequencies have a better clustering structure.

For the AR models a similar approach is taken. The number of parameters inthe AR model is plotted against the number of clusters in Figure 3.16. Two is for all

36

3.3. K-means

the number of parameters the best option for the number of clusters. The silhouettevalue is the highest with 5, respectively 4 parameters for the non-stimulus and thestimulus part. This is very low and it is impossible that the time series can be rebuilddecently. It could however again be interesting to compare this clusterings with othervariations. Therefore 5 is chosen for the number of parameters.

050

100150

200

0

5

10

15

20

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

number of AR parameters

silhouette values of AR−models of the non−stimulus part

number of clusters

mea

n si

lhou

ette

val

ue

0

50

100

150

200

0

5

10

15

20

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

number of AR parameters

silhouette values of AR−models of the stimulus part

number of clusters

mea

n si

lhou

ette

val

ue

Figure 3.16: The mean silhouette values compared between different number ofAR parameters and number of clusters. The clustering structure is highest for 5 ARparameters for the non-stimulus part and for 4 parameters for the stimulus part.

3.3.5 Cost

In this section the computational costs of the K-means algorithm will be discussed.The complexity of the algorithms will not be derived but only some comments aboutthe needed time in practice will be given. This approach is deliberately very problemspecific. The objective is to research how the algorithms perform for this kind ofdata, not in general. The measures where performed in MATLAB on a Intel Corei7-3630QM with a 2.4 GHz CPU and 8 GB of RAM.

In Table 3.6, the duration of the tuning is listed for every variation. This is thecomputation of the number of clusters and an extra parameter if available. One couldalso compute the duration of one clustering with a specific number of clusters andwith a specific choice for a parameter. This would give however a biased result. It isalways important to perform a good tuning, whatever the purpose of the clusteringis. Therefore, a single clustering without tuning would be rather useless.

Note however that the band-passed and AR variation have 2 parameters to tuneand take therefore more time to tune. There are some other things that stand out in

37

3. Methods

this Table. First it is noticeable that the algorithms that work with the Euclideandistance take much longer as the other variations. The reason for this is that it ismuch easier and faster to compute the correlations. When the cluster centers as wellas the data are normalized, the correlation is reduced to a simple dot product (seeequation 3.6).

Variation

Variation Non-stimulus Stimulus

Basic correlation 1236 s 804 sEuclidean distance after normalization 29083 s 17389 sEuclidean distance 9277 s 6402 sNo-outliers 73 s 115 sLow-pass filtered 954 s 603 sBand-passed 8760 s 8791 sAR 8142 s 7363 s

Table 3.6: Computational cost of the tuning.

2 4 6 8 10 12 14 16 18 200

0.2

0.4Time

Sec

onds

2 4 6 8 10 12 14 16 18 200

10

20Phase 1

Num

ber

of it

erat

es

2 4 6 8 10 12 14 16 18 200

10

20

Num

ber

of it

erat

es

Phase 2

Number of clusters

No−stimulusstimulus

Figure 3.17: The time of one replicate and the number of iterations in each phasefor that run. Tested for the basic k-means algorithm with multiple number of clusters.Phase one is the part where clusters centers are computed as the mean of the neuronsof that cluster where after the neurons are again divided to their nearest center (until

convergence). Phase two exists out of individual reassignments of the neurons.

38

3.3. K-means

Another interesting fact is that the stimulus part almost always seems to befaster. More details about this can be seen in Figure 3.17. It is clear that especiallyphase 2 (the individual reassignments of the neurons) need more iterations in thenon-stimulus part. It takes more time before the algorithm is fully converged to alocal minimum.

3.3.6 Some results from the k-means algorithm

In this section the cluster centers of some of the variations of the algorithm are shown.These cluster centers are very important and central in the k-means algorithm. TheAR model centers are not shown because these centers are not in the time domain.Reconstructing the signal does not give good results. For the visualization of theclusters, the number of clusters is chosen to be three. Two cluster centers wouldmaybe be more appropriate following the results of the tuning section, but all thecluster centers in such an analysis are also visible when the algorithms look for threeclusters. A division into two clusters is just a division in an active and an inactivepart, three clusters give more information and in thus more interesting. More clusterscould be even more interesting but than the internal validation measure that was infavor of two clusters would be neglected.

0 100 200 300 400 500 600 700−0.1

0

0.1Basic

0 100 200 300 400 500 600 700−200

0

200Euclidean distance

0 100 200 300 400 500 600 700−0.1

0

0.1Low−pass filtered

0 100 200 300 400 500 600 700−0.1

0

0.1Band−pass filtered

0 100 200 300 400 500 600 700−0.2

0

0.2HCN

0 100 200 300 400 500 600 700−5

0

5Normalized Euclidean distance

Cen

ter

ampl

itude

s

Time steps

Figure 3.18: The cluster centers of different variations after K-means clustering ofthe non-stimulus part. Similar cluster centers are in the same color. For all variations,

the preprocessed data that is actually clustered is shown.

39

3. Methods

In Figure 3.18 the centers of most of the variations of the non-stimulus partare shown. The centers are similar in all the variations, except in the band-passedversion where only higher frequencies are kept. Three clusters is maybe too muchfor this stimulus part because one can only see one cluster of peaks and two otherswhich do not differ too much from each other in most of the variations. For all thevariations, the preprocessed data is used in the figure because this is the data that isclustered.

This is however not the case in the stimulus part shown in Figure 3.19. All thecluster centers now have fundamentally different forms. The forms are the clearestin the HCN variation where the clouding ’outliers’ are deleted. One center are thepeaks which clearly accompany the stimuli. A second center shows the kind of peaksthat also were visible in the non-stimulus part. A third center also accompaniesthe stimuli but with a delay and less aggressive. It is interesting to see that theEuclidean variation only seems to find two kind of centers: the stimuli peaks andrandom noise. The fall from two to three clusters from the mean silhouette valuewas also the biggest so this may have been expected (see Figure 3.10).

600 700 800 900 1000 1100 1200 1300−0.1

0

0.1Basic

600 700 800 900 1000 1100 1200 1300−200

0

200Euclidean distance

600 700 800 900 1000 1100 1200 1300−0.1

0

0.1Low−pass filtered

600 700 800 900 1000 1100 1200 1300−0.1

0

0.1Band−pass filtered

600 700 800 900 1000 1100 1200 1300−0.2

0

0.2HCN

600 700 800 900 1000 1100 1200 1300−5

0

5Normalized Euclidean distance

Cen

ter

ampl

itude

s

Time steps

Figure 3.19: The cluster centers of different variations after K-means clustering ofthe stimulus part. Similar cluster centers are in the same color. For all variations,

the preprocessed data that is actually clustered is shown.

40

3.4. Hierarchical clustering

3.4 Hierarchical clusteringBesides partitional clustering, the second big group of algorithms are the hierarchicalclustering algorithms [30].

3.4.1 Theory

In hierarchical clustering, a tree of clusters is created. One can do this either bottom-up (agglomerative) or top-down (divisive). In agglomerative clustering, all theobjects are regarded as single clusters in the beginning. Then the two closest (highestsimilarity or lowest distance) clusters are merged together into a bigger cluster. Thisrepeats until all objects are merged together or until a certain stop-criterium isreached (for example, if the two most similar clusters are too dissimilar according acertain treshold). The top-down version starts with merging all the objects into onecluster. After that, the following two steps are repeated.

1. Select the least coherent cluster

2. Divide this cluster into two clusters

Note that there is another cluster algorithm needed for the second step. Thereforeonly agglomerative clustering will be used here. To quantify the distance betweentwo clusters, Ca and Cb, with cluster centers or cluster means ~sa and ~sb and numberof objects Na and Nb there are four common options:

1. single link: d(Ca, Cb) = min(d(~x, ~y)|~x ∈ Ca, ~y ∈ Cb)

2. complete link: d(Ca, Cb) = max(d(~x, ~y)|~x ∈ Ca, ~y ∈ Cb)

3. average link: d(Ca, Cb) = mean(d(~x, ~y)|~x ∈ Ca, ~y ∈ Cb)

4. Ward’s method: d(Ca, Cb) = ‖ ~sa−~sb‖2

1Na

+ 1Nb

In the single link option, the distance between two clusters is given by the minimumdistance between two elements (each of a different cluster). A problem often en-countered with this method is ’chaining’. An object close to one object of a cluster,but far from the other objects in the cluster is added to the cluster. After this, anew object near the just added object but even farther from the rest is included inthe cluster. This goes on and on, forming a chain of objects that are near to theirneighboring objects but far from the rest.

In the complete link method, the opposite is done. The distance between twoclusters is now measured as the largest distance between two objects of two clusters.This method is not affected by chaining but is very sensitive to outliers.

The third method, average link, tries to avoid chaining and sensitiveness tooutliers. It uses the average distance between the objects in the two clusters.

The last method is Ward’s method of minimum variance. This method is almostonly used with the (squared) Euclidean distance. At every step it tries to merge twoclusters to minimize the total within-cluster variance.

41

3. Methods

Because correlation will be used as similarity measure and chaining and a biginfluence of outliers has to avoided, the average linkage will be used.

There are several well known problems with hierarchical clustering. The firstproblem is that once two clusters are merged (or a division in clusters is made inthe top-down version), this can not be undone. The second is that they don’t scalewell. One always needs to store a distance matrix with distances between all theobjects. This are O(N2) entries, with N the number of objects to cluster. Alsothe time-complexity is at least O(N2). BIRCH, (balanced iterative reducing andclustering using hierarchies), is an adaptation to the hierarchical clustering algorithmto handle a bigger amount of data that can’t even fit in the working memory [74].For this data this complexity is however not a problem because there are only 965neurons. When more neurons are measured BIRCH can be considered.

For validation there is a special measurement for hierarchical clustering. Herethe cophenetic matrix is compared to the distance matrix [64]. First the copheneticmatrix has to be formed. The input of the matrix at point (i, j) is the distancebetween the clusters of object ~xi and ~xj right before they are merged. Together withthe distance matrix, the cophenetic correlation coefficient is then computed, usingFormula 3.15 from Hubert’s normalized Γ. Even high values of this correlation mustbe treated relatively. Values as high as 0.9 do not necessary mean a close agreementbetween the matrices [58].

3.4.2 Implementation

Also for this algorithm, correlation will be used as similarity measure. In this casethe algorithm needs a distance matrix so the correlations are again converted todistances by using the formula 1− cor(~x, ~y). The distance matrix and the way tocompute distances between clusters (average linkage) is everything the algorithmneeds to create a dendrogram. The next step would be to decide on the number ofclusters. To create a number of clusters one just has to stop merging on a certainpoint and the remaining clusters are the final clusters. However when one lookscloser to the results, one can see that this is not the optimal approach for this data.For example when one asks for 5 clusters in the non-stimulus part one gets a bigcluster of 924 neurons and 4 smaller ones with less than 25 neurons. This is of coursenot desirable. It is the effect of one big cluster where outliers get added to in smallgroups. It is even better indicated in Figure 3.20. In this Figure the number ofclusters is plotted against the number of neurons in the biggest cluster that such aclustering creates. This image shows clearly that when more clusters are asked forit only means that some small number of neurons gets split from the big chunk ofneurons. Such neurons that are split from the big chunk can maybe be labeled asoutliers. Only when more than 300 clusters are asked for, the big cluster finally splitsin two. This does not indicate a good clustering structure. Note that the actualprocess of this agglomerative clustering goes in the opposite direction: neurons arenot split of but clusters are merged together.

42


0 200 400 600 800 10000

100

200

300

400

500

600

700

800

900

1000

Number of clusters

Big

gest

clu

ster

Bad clustering structure

Figure 3.20: The number of clusters versus the biggest cluster in that case. Testedfor the non-stimulus case. This shows that when more clusters are asked for, thesame big cluster is returned combined with several smaller clusters. Only after 300

clusters are asked for, a second big cluster is created.

Because one wants to find real clusters with more members, another approachis taken. Clusters with a number of neurons smaller than 20 are merged togetherin one big trash cluster. This number is quite arbitrary and can be changed ifneeded. If it was bigger, barely any real clusters would be kept. If it was smallerone would keep uninteresting ’outlier’ clusters. The number used here is the sameas the number of good neighbors a HCN should have. Using this procedure onecan measure the number of real clusters (clusters with more than 20 neurons) atwith each cut off . The cut off is the maximum distance between two clusters, themaximum dissimilarity between two clusters, that is allowed to merge them. Forthe non-stimulus part this is shown in Figure 3.21 and for the stimulus part in 3.23.With these plots we have still no idea what the size of these clusters is (except thatit is bigger than 20). Therefore plots 3.22 and 3.24 are created. In these plots everyneuron gets at every cut off a color. Dark blue means that the neuron is in the trashcluster, another color means that the neuron is clustered in a good cluster. One cansee clusters growing and merging together with other clusters. With these two plotsone can give a good evaluation and choose a suitable cut off frequency.

43

3. Methods

When the cut off gets higher, new good clusters are formed and existing clustersattract new neurons. The number of cluster grows. However at the same timegood clusters are merged together and the number of clusters decreases. It is alsoimportant to interpret the cut off in one way or another. When the cut off is 0.9 theaverage correlation of two groups that can be merged is minimum 0.1, which is quitelow. A lower cut off means that the good clusters at that point are tighter. However,a too low cut off means that barely any neurons get clustered in a real (non-trash)cluster.

0.2 0.4 0.6 0.8 10

1

2

3

4

5

6

7

8

9

10

Cut−off

Number of clusters in hierarchical clustering non−stimulus part

Number of clusters(Number of non−trash neurons)/100

Figure 3.21: The number of good (more than 20 neurons) clusters versus the cutoff and the number of neurons clustered in such a good cluster versus the cut off.

Hierarchical clustering was used here on the non-stimulus part.

For a clustering of K clusters, the cut off is decided so that there are Kclustersand this K clusters have an as long as possible lifetime. The lifetime is the differencebetween the cut off where the cluster gets created and the cut off where the clustermerges to an older cluster. The neurons that are not clustered in a real cluster forthat cut off are clustered to the nearest real cluster (also measured with averagedistance). The minimum size of a real cluster is standard taken as 20 but is loweredif more clusters need to be created.

One can see in Figure 3.22 that there are maximum two or three good clusters inthe non-stimulus part. The other clusters are only formed starting at a high cut off.These clusters contain neurons that are barely similar to each other. In the stimuluspart (Figure 3.24) one can also only determine a maximum of four clusters that have

44


a long lifetime.

Cut−off

Neu

rons

Clustering of the neurons, non−stimulus part

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

100

200

300

400

500

600

700

800

900

Figure 3.22: The creation and merging of good clusters (more than 20 neurons)versus the cut off. The dark-blue color indicates that the neuron has not beenclustered in a good cluster and is in a trash cluster. The other colors indicate good

clusters. Hierarchical clustering was used here on the non-stimulus part.

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.20

1

2

3

4

5

6

7

8

9

10

Cut−off

Number of clusters in hierarchical clustering stimulus part

Number of clusters(Number of non−trash neurons)/100

Figure 3.23: The number of good (more than 20 neurons) clusters versus the cutoff and the number of neurons clustered in such a good cluster versus the cut off.

Hierarchical clustering was used here on the stimulus part.

45

3. Methods

Cut−off

Neu

rons

Clustering of the neurons, stimulus part

0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

100

200

300

400

500

600

700

800

900

Figure 3.24: The creation and merging of good clusters (more than 20 neurons)versus the cut off. The dark-blue color indicates that the neuron has not beenclustered in a good cluster and is in a trash cluster. The other colors indicate good

clusters. Hierarchical clustering was used here on the non-stimulus part.

3.4.3 Results

For the non-stimulus part the cut off is chosen to be 0.875. In that case there areclearly 2 quite big clusters. The resulting mean of the clustered neurons and theirindividual amplitudes can be seen in Figure 3.25.

For the stimulus part the cut off is chosen to be 0.825. In that case there existthree fairly big good clusters. The results are shown in Figure 3.26. One clearly candifferentiate the two kind of stimulus related peaks and the other kind of peaks.

In these plots the neurons of the trash-cluster are grouped together. Anotheroption would be to cluster them in the good cluster where they are nearest too.

Another advantage of hierarchical clustering is that one can make a dendrogram.This way one can clearly see which clusters merge together.

46


Reactions of neurons of trash cluster

Neu

rons

100 200 300 400 500 600

100

200

300

400

500

0 200 400 600 800−4

−2

0

2

4

6Mean of neurons of trash cluster

Mea

n am

plitu

de

Reaction of neurons of cluster 1

Neu

rons

100 200 300 400 500 600

50

100

150

200

250

0 200 400 600 800−10

−5

0

5

10Mean of neurons of cluster1

Mea

n am

plitu

de


Time step

Neu

rons

100 200 300 400 500 600

20

40

60

80

100

120

1400 200 400 600 800

−10

0

10

20


Time step

Mea

n am

plitu

de

Figure 3.25: Non-stimulus part of hierarchical clustering: The mean values (clustercenters) of each cluster and the individual responses of the clustered neurons. Theupper plots is the trash-cluster with all the neurons that have not been clustered ina good (more than 20 neurons) clusters. The time series of the individual neurons

and the cluster centers have been normalized.

47

3. Methods

Reactions of neurons of trash cluster

Neu

rons

700 800 900 1000 1100 1200 1300

200

400

600600 700 800 900 1000 1100 1200 1300

−5

0

5Mean of neurons of trash cluster

Mea

n am

plitu

de


Neu

rons

700 800 900 1000 1100 1200 1300

20406080

100120140

600 700 800 900 1000 1100 1200 1300−10

0

10

20


Mea

n am

plitu

de


Neu

rons

700 800 900 1000 1100 1200 1300

20

40

60

80600 700 800 900 1000 1100 1200 1300

−20

0

20


Mea

n am

plitu

de


Time step

Neu

rons

700 800 900 1000 1100 1200 1300

20406080

100120

600 700 800 900 1000 1100 1200 1300−10

0

10


Time step

Mea

n am

plitu

de

Figure 3.26: Stimulus part of hierarchical clustering: The mean values (clustercenters) of each cluster and the individual responses of the clustered neurons. Theupper plots is the trash-cluster with all the neurons that have not been clustered ina good (more than 20 neurons) clusters. The time series of the individual neurons

and the cluster centers have been normalized.

3.4.4 Cost

The algorithm does not take a lot of time. The longest part is the search for a goodcut off. The algorithm takes 94 seconds for the non-stimulus part and 91 seconds forthe stimulus part. As regards the memory, one should note that the distance matrixwith N2 numbers should be stored, with N the number of neurons.

3.5 Spectral clusteringMost of the literature concerning fMRI data clustering was written from five tofifteen years ago. More recent clustering algorithms have not yet been tested onsuch data. These new algorithms should however be improvements compared to

48

3.5. Spectral clustering

the older ones and should outperform the classic algorithms. Therefore one of suchalgorithms was chosen to add in this comparison of algorithms. This algorithm isspectral clustering which is inspired by graph theory. Spectral clustering has becomeincreasingly popular in the last five years [44].

3.5.1 Theory

Spectral clustering works with the similarity matrix and its eigenvalue decomposition.This matrix S is given by

Sij = s(~xi, ~xj) (3.25)

Also the diagonal matrix D with as elements the degree of each time-serie givenby

Dii =N∑j=1

Sij (3.26)

After this the unnormalized Laplacian L is defined as

L = D − S (3.27)

Then the eigenvalue decomposition of this matrix is computed. The eigenvectorsfrom the K smallest eigenvectors are used as new data for the clustering problem.This clustering is mostly done with K-means clustering. A more advanced and bettersolution is not to use the unnormalized Laplacian but a normalized version [63].

Lnorm = I −D−1S = D−1L (3.28)

Clustering is about maximizing between cluster dissimilarity and within clustersimilarity. Unnormalized spectral clustering has the disadvantage compared to thisnormalized clustering to only focus on the first of these two objectives. Note howeverthat there are also other ways to normalize the Laplacian. This normalized solutionhas however the best consistency features according to Luxburg [44]. To make theseformulas more intuitive one can compare them with the theory behind graph randomwalks [43]. The probability to jump from one point i in a graph to another point j(a random walk) is given by Sij/Di. Therefore, one can define the transition matrixof such random walks by I minus the normalized Laplacian Lnorm. It follows thatthe lowest eigenvalues and eigenvectors from this Lnorm say something about thecluster properties of the graph [43].

Number of clusters

For the selection of the number of clusters, a special spectral clustering techniquecan be applied. In a perfect data case with K clusters, there would be K eigenvaluesthat are zero or almost zero and the other eigenvalues would be way higher. Ofcourse in the noisy data from the neurons this is not expected. Hopefully howeverthere will be some small eigenvalues and then a gap to higher eigenvalues.

49

3. Methods

3.5.2 Variations

Also in the spectral clustering case four variations will be compared. All the variationswill only differ in the way the similarity matrix is formed. First there are the twoNn-nearest neighbors cases where correlation is used as similarity measure. Oneversion is the mutual Nn-nearest neighbor version where two neurons only havesimilarity 1 when they are both nearest neighbors of each other. The mutual nearestneighbor version does not (or not so much) connect points that come from differentdensities. The other is the normal nearest neighbors variation where only one neuronneeds to be a nearest neuron of the other (and not necessary the other way around)for the similarity to be set to 1. To compute the nearest neighbors, the Euclideandistance is used after the data is normalized. It would however make no difference ifcorrelations where used.

Second, there are the two other variations that make use of the RBF-kernel.One time with normalized data (remember that correlation is very similar to theEuclidean distance if the data is normalized) and one time with the raw data.

3.5.3 Tuning

Also in the case of Spectral clustering, there are different parameters to be tuned. Inthe case of Nn-nearest neighbors, this is the number Nn and the number of clustersK. For the normal nearest neighbor case Luxburg et al. propose to use log(numberof neurons) as good rule of the thumb [44]. For the mutual nearest neighbor versionthey give no hints. However, the similarity matrix has to be created in a way thatthere are less connected components as that there are clusters to find. A connectedcomponent is a component of neurons that are nearest neighbors of each other butnot of any neuron outside the connected component. For the normal nearest neighborcase this is not hard, already with Nn = 2 this requirement is satisfied. For themutual case, more neighbors are needed as expected. The connected componentsversus the number of neighbors are shown in Figure 3.27. To choose a number thesilhouette value of the clustering of the eigenvectors is plotted against the number ofnearest neighbors. The silhouette value should give an approximation of how goodthe clustering structure is. For the normal nearest neighbor, the mutual nearestneighbor the RBF and the RBF after normalization version this is respectively shownin Figures 3.28, 3.29, 3.30 and 3.31. Also the the number of neurons in the smallestcluster is shown in these Figures, because sometimes the data is split in an outlierand the rest, which is of course not a desirable result. As expected for the mutualnearest neighbor variation, when the number of neighbors is too low the smallestcluster is very small and exists out of one connected component which may evenexist out of only one outlier. Only the clusterings with number of clusters between 2and 5 are shown because by now it is expected that the optimal number of clusterslies in this region. More clusterings would make the Figure clouded.

For the normal nearest neighbor case in the non-stimulus part the silhouettevalues are clearly highest for 2 nearest neighbors for most of the number of clusters.This goes however together with a very small size of the smallest cluster in clusterings

50


with 3 clusters or more. A higher number of neighbors barely gets this smallestcluster bigger until Nn is chosen to be 145. Therefore 145, which still ensures quitegood silhouette values, is chosen as the optimal value. For the stimulus part it ishowever possible to enlarge the smallest cluster more easy. In that case 15 is a moreoptimal value. In the non-stimulus part of the mutual neighbor variation, 100 nearestneighbors give a good compromise between the number of neurons in the smallestcluster and the silhouette value. For the stimulus part 120 would be good choice.

For the choice of σ the same tuning process is done. Also here a compromisebetween the goodness of clustering structure and the number of neurons in thesmallest cluster has to be made. For the RBF version with normalized data, thisgives a result of 256 for both the stimulus and non-stimulus part. Also in this case,no mater the value of σ, the smallest cluster will stay small in the non-stimulus part.However when the data is not normalized, this feature gets even worse. There, thebest value is 4096, but even then there are only 2 neurons in the smallest clusterwhen the number of clusters is chosen to be 4 in the non-stimulus part. Even more,also in the stimulus part there are now only 4 neurons left in the smallest cluster.This σ should ensure that near neighbors have a high similarity and others have alow similarity. Unfortunately this is not possible with this data. No matter whatσ or number of nearest neighbors one takes, the difference between the mean RBFvalue of nearest neighbors and non-nearest neighbors is never really big. This ispartly because the clustering structure is clouded by the LCN. Such big optimalvalues of the parameter σ show also that the problem is quite linear.

0 50 100 1500

100

200

300

400

500

600

700

800

900Non−stimulus

number of mutual nearest neighbors

num

ber

of c

onne

cted

com

pone

nts

0 50 100 1500

100

200

300

400

500

600

700Stimulus

number of mutual nearest neighbors

num

ber

of c

onne

cted

com

pone

nts

Figure 3.27: Number of connected components versus the number of mutual nearestneighbors (in the non-stimulus part). On the left side the non-stimulus part and on

the right side the stimulus part.

51

3. Methods

100

101

102

103

0.4

0.5

0.6

0.7

0.8

0.9

1

Mea

n si

lhou

ette

val

ue

Non−stimulus

2 clusters3 clusters 4 clusters5 clusters

100

101

102

103

0

50

100

150

200

Number of nearest neighbors

Siz

e in

sm

alle

st c

lust

er


100

101

102

103

0.4

0.5

0.6

0.7

0.8

0.9

1

Mea

n si

lhou

ette

val

ue

Stimulus


100

101

102

103

0

100

200

300

400

500

Number of nearest neighbors

Siz

e in

sm

alle

st c

lust

er


Figure 3.28: Mean silhouette value and size of the smallest cluster versus thenumber of nearest neighbors. On the left side the non-stimulus part and on the rightside the stimulus part. For the non-stimulus part 145 is chosen (otherwise the size of

the smallest cluster is too small). For the stimulus part 15 is chosen.

101

102

103

0

0.2

0.4

0.6

0.8

1

Mea

n si

lhou

ette

val

ue

Non−stimulus


101

102

103

0

100

200

300

400

500

Number of mutual nearest neighbors

Siz

e in

sm

alle

st c

lust

er


101

102

103

0

0.2

0.4

0.6

0.8

1

Mea

n si

lhou

ette

val

ue

Stimulus


101

102

103

0

100

200

300

400

500

Number of mutual nearest neighbors

Siz

e in

sm

alle

st c

lust

er


Figure 3.29: Mean silhouette value and size of the smallest cluster versus thenumber of mutual nearest neighbors. On the left side the non-stimulus part and onthe right side the stimulus part. The chosen values for the non-stimulus and the

stimulus part are 100 and 120.

52


100

105

1010

1015

−0.2

0

0.2

0.4

0.6

0.8

1

1.2

Mea

n si

lhou

ette

val

ue

Non−stimulus


100

105

1010

1015

0

100

200

300

400

500

Sigma

Siz

e in

sm

alle

st c

lust

er


100

105

1010

1015

−0.2

0

0.2

0.4

0.6

0.8

1

1.2

Mea

n si

lhou

ette

val

ue

Stimulus


100

105

1010

1015

0

100

200

300

400

500

Sigma

Siz

e in

sm

alle

st c

lust

er


Figure 3.30: Mean silhouette value and size of the smallest cluster versus sigma.On the left side the non-stimulus part and on the right side the stimulus part. Thedata was normalized before the algorithm. The chosen σ is 256 for both the stimulus

and the non-stimulus part.

100

105

1010

1015

−0.2

0

0.2

0.4

0.6

0.8

1

1.2

Mea

n si

lhou

ette

val

ue

Non−stimulus


100

105

1010

1015

0

100

200

300

400

500

Sigma

Siz

e in

sm

alle

st c

lust

er


100

105

1010

1015

−0.2

0

0.2

0.4

0.6

0.8

1

1.2

Mea

n si

lhou

ette

val

ue

Stimulus


100

105

1010

1015

0

100

200

300

400

500

Sigma

Siz

e in

sm

alle

st c

lust

er


Figure 3.31: Mean silhouette value and size of the smallest cluster versus sigma.On the left side the non-stimulus part and on the right side the stimulus part. The

chosen σ is 4096 for both the stimulus and the non-stimulus part.

53

3. Methods

For the number of clusters a good option for the spectral clustering algorithm isto look at the eigenvalues. In the optimal scenario, if the optimal number of clustersis K than the K smallest eigenvalues are zero. In less optimal cases one still hopesto see a knee, a big increase in eigenvalue K+ 1. It is already clear from the previousalgorithms that there is no clear optimal number of clusters. Also in the case ofspectral clustering one can not find a knee and so the optimal number of clusters isundecided.

Cost

The first part of the algorithm can be implemented quite fast. The eigenvaluedecomposition takes only 0.5 seconds to compute. Because the nearest neighborvariation creates sparse matrices, this duration can even be reduced to 0.25 seconds.Altogether, the algorithm takes 1.3 seconds to compute the eigenvectors from scratchfor the nearest neighbor variation and 1.1 seconds for the RBF kernel variation (forthis variation it is easier to compute the similarity matrix). The longest part in thisalgorithms is the k-means clustering with this eigenvectors. Especially in the tuningcase where the k-means algorithm has to run 19 times to estimate the goodness ofthe clustering for 19 different number of clusters. The tuning durations can be foundin Table 3.7.


Nearest neighbor 225 s 169 sMutual nearest neighbor 245 s 273 sRBF scaled 687 s 640 sRBF 749 s 623 s

Table 3.7: Time of the tuning process.

Note that the tuning on the RBF variations tested twice as many parametervalues as the nearest neighbors variations. This makes that the variations would notdiffer too much in duration if the same number of values were tested. The k-meansalgorithm together with the computation of the silhouette value is responsible forabout 90% of the duration.

Same as in the hierarchical clustering one also needs to store a similarity matrix,which grows quadratically with the number of neurons. In case of the (mutual)nearest neighbor variation, this is not so much a problem, because this is a sparsematrix.

Results

The cluster centers or means make little sense in the case of spectral clustering. Whenfor example two concentric circles with different radii are clustered with spectralclustering, both clusters would have the same center, though the clusters are very

54


different. It is however possible to plot the activation of each individual clusteredneuron. The mean is however still plotted for visual reasons.

For the mutual nearest neighbor case this can be seen in Figures 3.32 and 3.33for 4 clusters for respectively the non-stimulus and the stimulus part.

Neu

ron

100 200 300 400 500 600

50

100

150

200

2500 200 400 600 800

0

0.2

0.4

0.6

0.8

1

Act

ivat

ion

mean of cluster

Neu

ron

100 200 300 400 500 600

20

40

60

80

100

120

0 200 400 600 8000

0.2

0.4

0.6

0.8

1

Act

ivat

ion

mean of cluster

Neu

ron

100 200 300 400 500 600

50

100

150

0 200 400 600 8000

0.2

0.4

0.6

0.8

1

Act

ivat

ion

mean of cluster

Time step

Neu

ron

100 200 300 400 500 600

100

200

300

0 200 400 600 8000

0.2

0.4

0.6

0.8

1

Time step

Act

ivat

ion

mean of cluster

Figure 3.32: The neurons are clustered in 4 clusters using spectral clusteringwith mutual nearest neighbor as similarity method. The clustering is done on thenon-stimulus part. On the left is the activation of the individual neurons. The morered the color, the higher the activation. On the right they are plotted together with

the mean. The time series are normalized before plotting them.

55

3. Methods

Neu

ron

800 1000 1200

50

100

150

200

250600 800 1000 1200 14000

0.2

0.4

0.6

0.8

1

Act

ivat

ion

mean of cluster

Neu

ron

800 1000 1200

20

40

60

80

100

120

600 800 1000 1200 14000

0.2

0.4

0.6

0.8

1

Act

ivat

ion

mean of cluster

Neu

ron

800 1000 1200

50

100

150

600 800 1000 1200 14000

0.2

0.4

0.6

0.8

1

Act

ivat

ion

mean of cluster

Time step

Neu

ron

800 1000 1200

100

200

300

600 800 1000 1200 14000

0.2

0.4

0.6

0.8

1

Time step

Act

ivat

ion

mean of cluster

Figure 3.33: The neurons are clustered in 4 clusters using spectral clustering withmutual nearest neighbor as similarity method. The clustering is done on the stimuluspart. On the left is the activation of the individual neurons. The more red the color,the higher the activation. On the right they are plotted together with the mean.

The time series are normalized before plotting them.

56

3.6. Fuzzy c-means clustering

3.6 Fuzzy c-means clusteringAll the clustering algorithms seen so far are hard clustering algorithms. This meansthat there will always be an all or nothing decision: each object belongs to one andonly one cluster. Another category is fuzzy clustering or soft clustering. In thisclustering, an object can be in multiple clusters. For each cluster it has a membershipvalue. Such an algorithm might be better suited to realistically describe the clusteringsituation of the zebra-fish. Every neuron has more than one function and can bepart of multiple functional sections in the brain. The best known fuzzy clusteringalgorithm is the fuzzy c-means algorithm described by Bezdek [11]. As the namesuggests, there is much resemblance with the k-means algorithm.

3.6.1 Theory

The objective function this algorithm tries to minimize is:

J(C) =K∑k=1

∑~xi∈Ck

(uki)md(~xi − ~sk)2 (3.29)

uki is the degree of membership of the object ~xi to cluster Ck. Also step twoin the k-means algorithm is changed. The objects are not assigned to one clusteranymore. Instead there degree of membership is updated. This is done with thefollowing equation:

uki =

K∑j=1

(d(~sk, ~xi)d(~sj , ~xi)

)2/(m−1)−1

(3.30)

With m called the fuzzifier. Finally the cluster centers or cluster means arecomputed as follows:

~sk =∑Ni=1(uki)m ~xi∑Ni=1(uki)m

(3.31)

The parameter m also known as the fuzzifier determines the level of cluster fuzziness.It is always greater than or equal to one. When it is one, the method converges toa hard clustering again. When m becomes bigger the clusters get fuzzier. Whenthere exists no prior information, the parameter is mostly set to 2 [71]. HoweverSchwmämmle et al. [61] object this and propose the following formula to choose thefuzzifier:

m = 1 +(1418

N+ 22.05

)T−2 + (12.33

N+ 0.243) ∗ T−0.0406ln(N)−0.1134 (3.32)

With N the number of neurons and and T the number of time steps. Anotheradvantage of fuzzy c-means is that its much more robust to noise [29]. Outliers orneurons that do not clearly belong to a cluster have lower membership values and donot influence the cluster centers that much.

57

3. Methods

Validation measure

Also for soft clustering algorithms (e.g., fuzzy c-means) validity measures exist. Asimple one is the partition coefficient [10]:

PC = 1K

K∑i=1

N∑j=1

u2ij (3.33)

Where N is the number of neurons and K the number of clusters. This measuregets closer to unity when the clustering becomes harder. When most of the objectsare divided over the groups (there is barely a clustering structure), this measuregets lower. When the measure approaches the minimum 1/K, it means that thealgorithm totally failed or there is no structure at all [53]. To find the optimum, oneneeds to look for a knee. Note however that this measure is quite influenced by thefuzzifier m. When m approaches one from above, the measure can not differentiateanymore between the number of clusters. Additionally, when m approaches infinitythe knee is always found around two clusters.

Another internal validation measure is the minimum centroid distance [61]

MCD = mini 6=j

d(~si, ~sj) (3.34)

Also here one has to look for a knee.


The fuzzy c-means algorithm was computed on two variations. One with thecorrelation as distance and one with Euclidean distances on the normalized data.

3.6.3 Tuning

For the zebra-fish data, equation 3.32 leads to m = 1.0202. Note however thatSchwmämmle et al. [61] computed their formula for Euclidean distances. It is notknown if the effects are the same if correlation is used. Therefore m = 1.0202 istaken as a standard, but other values around this m are also checked. To checkthe influence of a certain fuzzifier, the minimum centroid distance is measured forseveral number of clusters. If this distance is 0 it means that at least two clustersare the same, the fuzzifier is too high. The results are shown in Figure 3.34. For thestimulus part of the correlation variation, the fuzzifier seems to be a bit too low. Ahigher fuzzifier would not have drastic impact on the cluster centers, but only reducethe influence of noise and outliers. Therefore a fuzzifier of 1.09 is chosen there. Forall the other variations the fuzzifier proposed by equation 3.32 is kept. A fuzzifier of2 would make almost all the minimum centroid distances 0.

The validation measures are plotted in Figure 3.35. For the Euclidean distancevariation, both validation measures give the same result. For the stimulus part thereis a knee on 3 clusters. For the non-stimulus part there seem to be two knees, one on3 and one on 5 clusters. For the correlation variation, the centroid distance indicates

58


2 or 4 clusters for the stimulus part and 2 clusters for the non-stimulus part. Thepartition coefficient has however strange effects on this variation, it does not seemto have the same kind of results as for with the Euclidean distance, especially for ahigher number of clusters.

1 1.05 1.1 1.15 1.2 1.250

2

4

6

8

10

12No stimulus: Euclidean distance

Fuzzifier

Min

imal

cen

troi

d di

stan

ce

1 1.05 1.1 1.15 1.2 1.250

2

4

6

8

10

12

Fuzzifier

Min

imal

cen

troi

d di

stan

ce

Stimulus: Euclidean distance

2 clusters3 clusters4 clusters5 clusters6 clusters7 clusters

1 1.05 1.1 1.15 1.2 1.250

0.2

0.4

0.6

0.8

1

1.2

1.4No−stimulus correlation

Fuzzifier

Min

imal

cen

troi

d di

stan

ce

1 1.05 1.1 1.15 1.2 1.250

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6Stimulus correlation

Fuzzifier

Min

imal

cen

troi

d di

stan

ce

Figure 3.34: The minimum centroid distance for several fuzzifiers.

0 5 10 15 200

2

4

6

8

10

12

14

Number of clusters

Min

imum

cen

troi

d di

stan

ce

Centroid distance, Euclidean distance

no−stimulusstimulus

0 5 10 15 20

0.4

0.5

0.6

0.7

0.8

0.9

1

Number of clusters

Par

titio

n co

effic

ient

Partition coefficient, Euclidean distance


0 5 10 15 200

0.5

1

1.5

Number of clusters

Min

imum

cen

troi

d di

stan

ce

Centroid distance, correlation


0 5 10 15 20

0.4

0.5

0.6

0.7

0.8

0.9

1

Number of clusters

Par

titio

n co

effic

ient

Partition coefficient, correlation


Figure 3.35: Internal validation measures for fuzzy c-means clustering. On the leftside the centroid distance and on the right side the partition coefficient.

59

3. Methods

3.6.4 Cost

The duration of the tuning was 989 seconds for the non-stimulus part and 492 secondsfor the stimulus part when the Euclidean distance was used on the normalized data.The correlation variation was as expected again a lot faster: 16 seconds for thenon-stimulus part and 10 seconds for the stimulus part.

3.6.5 Results

It is interesting to look at the membership functions of each neuron. In Figure 3.36the boxplots of the maximum membership function of the HCN and the LCN areshown. Clearly these membership functions are lower for the LCN. The LCN are notas clearly clustered as the HCN. This also means that the LCN have less effect onthe clustering and the centers as the HCN. In a way, this algorithm automaticallysplits the neurons in HCN and LCN, good neurons and outliers. This boxplots arecomputed for the non-stimulus part with 5 clusters.

Not only the time series of all the neurons are available but also the brain map canbe used. The locations of all the neurons in the brain are available. This informationwill be used more in the next sections and chapters. For the fuzzy c-means algorithmthis gives the opportunity for an interesting plot. When the number of cluster ischosen as 3, every neuron gets the membership function of each cluster colored inone specific color (red, green or blue). So for each neuron, the intensity of each coloris the membership function for each cluster. The result is seen in Figure 3.37 for thenon-stimulus part and in Figure 3.38 for the stimulus part.

0.4

0.5

0.6

0.7

0.8

0.9

1

HCN LCN

Figure 3.36: Boxplots of the membership functions of the HCN and the LCN forthe non-stimulus part with 5 clusters. The variation with Euclidean distances on the

normalized data is used.

60


0 200 400 600 800−0.5

0

0.5Cluster centers

0 200 400 600 800−1

−0.5

0

0.5

1

0 200 400 600 800−1

0

1

2

3

Time steps

Brain map

50 100 150 200 250

50

100

150

200

250

300

350

400

450

Figure 3.37: The neurons are clustered into 3 clusters using fuzzy c-means clustering.The intensity of the colors in the brain map are dependent on the membership

functions of each neuron. The non-stimulus part is clustered.

0 200 400 600 800−1

−0.5

0

0.5

1Cluster centers

0 200 400 600 800−0.5

0

0.5

1

0 200 400 600 800−1

0

1

2

3

Time steps

Brain map

50 100 150 200 250

50

100

150

200

250

300

350

400

450

Figure 3.38: The neurons are clustered into 3 clusters using fuzzy c-means clustering.The intensity of the colors in the brain map are dependent on the membership

functions of each neuron. The stimulus part is clustered.

61

3. Methods

3.7 Neural gas algorithm

This is a less known algorithm but performed quite good in previous comparativestudies as shown in Chapter 2. Unfortunately there is very little literature about thealgorithm or specific validation techniques.

3.7.1 Theory

The algorithm was introduced by Martinetz et al. [45]. It is inspired by the selforganizing map algorithm. It has the following steps:

1. First initialize a set of k clusters, c1, ..., ck. Every cluster has a center vector~sk.

2. Draw at random a time series ~xi.

3. Then the distance to the center-vectors is computed. All the center vectors getan index. i0 for the closest, i1 for the second closest,...

4. The center vectors are adapted: ~st+1ik

= ~stik + εe−kλ (~xi − ~stik)

5. Increase t. if t has not yet reached tmax continue with step 2

The ε and λ are different in every run as

λt = λinitial(λendλinitial

)(t/tmax) (3.35)

and

εt = εinitial(εendεinitial

)(t/tmax) (3.36)

The algorithm is actually very similar to the k-means algorithm, but may convergefaster.


For the implementation, MATLAB including the SOM toolbox is used [2]. Theinfluence of the choice of λinitial and εinitial is small and is usually taken as 0.5 andK/2 respectively. The algorithm works with the normalized data. Using the meansilhouette value, the best number of clusters is 2 for the non-stimulus part as well asfor the stimulus part. This can be seen in Figure 3.39.

62

3.7. Neural gas algorithm

0 5 10 15 20−0.04

−0.02

0

0.02

0.04

0.06

0.08

number of clusters

mea

n si

lhou

ette

val

ue

Silhouette values

non−stimulus partstimulus part

Figure 3.39: The silhouette values for the neural gas algorithm. Two is the optimalnumber of clusters for the non-stimulus as well as for the stimulus part.

0 200 400 600 800−20

0

20Cluster centers non−stimulus part

0 200 400 600 800−10

0

10

0 200 400 600 800−20

0

20

0 200 400 600 800−50

0

50

0 200 400 600 800−50

0

50

time steps

600 800 1000 1200 1400−20

0

20Cluster centers stimulus part

600 800 1000 1200 1400−20

0

20

600 800 1000 1200 1400−20

0

20

600 800 1000 1200 1400−100

0

100

600 800 1000 1200 1400−50

0

50

time steps

Figure 3.40: The cluster centers for a neural gas clustering with five clusters. Onthe left side is the clustering for the non-stimulus part and on the right side is the

clustering for the stimulus part.

3.7.3 Cost

The computation time for the non-stimulus as well as for the stimulus time was 263sfor all the cluster numbers together.

63

3. Methods

3.7.4 Results

The result are, as expected, very similar to the k-means version with normalizeddata. For the sake of variation, the cluster centers for a clustering with 5 clustersis shown in figure 3.40. For 3 clusters the clustering would almost be identical tothe clusterings in figures 3.18 and 3.19 for the non-stimulus and the stimulus partrespectively.

3.8 Independent Component AnalysisIn the previous section different clustering algorithms were discussed. In this chapter,an algorithm that is not originally designed as cluster technique will be introduced.It was however used to analyze fMRI data and is therefore interesting for comparisonpurposes. In this section Independent Component Analysis will be presented.

3.8.1 Theory

A well known problem solved by this method is the cocktail party problem. Whendifferent conversations are recorded by different microphones, this method is ableto separate the different conversations out of the mixture recorded by the differentmicrophones [35]. For an ICA, the mixtures x are given and one tries to find theindependent components or sources s. It is assumed that the mixtures are createdwith formula:

x = As (3.37)

with unknown mixing matrix A. The energy of the different sources can notbe estimated, therefore the magnitudes of the independent components are fixedto unit variance. In this case also the mean is assumed to be zero. The sign ofthe independent components will however still be unknown. The algorithm workswith the assumption that the independent components are independent and havea non-Gaussian distribution. There exist however different methods to performthis ICA. Miljkovic et al. [50] compared different such methods concerning theirperformance on Electroencephalography data. Because this is also data about thefunctioning of the brain, the same results may hold for the zebra-fish data. The bestperforming algorithm was the SOBI-algorithm. This algorithm uses the correlationsbetween the time series to perform ICA [8]. This is particularly interesting for thetemporal ICA (TICA). It will however also be used for the spatial ICA (SICA) (herenot the correlations between the time series but the correlation between the spatialmaps that are used).


Both temporal and spatial ICA are implemented. In the temporal version the outputare the sources that are mixed to produce the individual time series of all the neurons.The problem is however that the sign of this sources is not known. Too choose thesign of the sources, the correlation from the sources with the neurons is computed.

64

3.8. Independent Component Analysis

Only the high correlations (|r| > 0.4) are kept and counted. If There are morenegative than positive correlations, the sign of the source is turned. For some sourcesthere are no such correlations, then the definition of high correlation is decreased.To finally cluster the neurons, every neuron is grouped with the source where it hasthe highest correlation with.

In spatial clustering the output are groups of neurons that work together (spatialcomponents or maps) combined with the time series they produce. Using thesetime series one can again estimate the real signs of the time series and the spatialcomponents which produce them. The neurons are clustered in the group or spatialcomponent where the have the highest score (after normalization of the scores).

3.8.3 Cost

The costs of the algorithm are shown in Table 3.8.


SICA 254 s 147 sTICA 110 s 84 s

Table 3.8: Duration independent component analysises for finding 2 to 20 sources

0 200 400 600 800−2

0

2

4Non−stimulus part

0 200 400 600 800−5

0

5

0 200 400 600 800−5

0

5

0 200 400 600 800−5

0

5

Time step

600 800 1000 1200 1400−2

0

2

4Stimulus part

600 800 1000 1200 1400−5

0

5

600 800 1000 1200 1400−5

0

5

10

600 800 1000 1200 1400−5

0

5

Time step

Figure 3.41: 4 sources discovered with temporal ICA from the non-stimulus (left)and stimulus (right) part.

65

3. Methods

3.8.4 Results

For the temporal ICA the 4 most important sources are shown in Figure 3.41 for thestimulus and non-stimulus part. With the third independent component from thenon-stimulus part, the question can be asked if its sign is correct. Some peaks seemto be in the wrong direction.

The results from the spatial ICA are shown in Figure 3.42. To find the activatedbrain regions for each spatial component, the results are first normalized and thenthe neurons with a score higher than 1.5 are labeled activated. The mean of thecorresponding normalized time series are shown on the right side of the Figure.

Activated brain maps

50 100 150 200 250

100

200

300

400

0 200 400 600 800−2

0

2

4Activations

50 100 150 200 250

100

200

300

400

0 200 400 600 800−2

0

2

4

50 100 150 200 250

100

200

300

400

0 200 400 600 800−2

0

2

4

50 100 150 200 250

100

200

300

400

0 200 400 600 800−2

0

2

4

Time steps

Figure 3.42: Spatial ICA result from the stimulus part: 4 most important spatialmaps on the left (activation is shown in red) and their centers on the right.

66

3.9. Spatial Coefficient

3.9 Spatial Coefficient

When one looks at the spatial components in Figure 3.42 or the brain maps from thefuzzy c-means clustering in Figures 3.38 and 3.37 one can see that neurons that havesimilar time series and are grouped in the same cluster, also are spatially locatedclose to each other. It has been long known that neurons with similar functionsare located near to each other in the brain [24]. This can be used as externalvalidation measure. It is expected that good clusterings lead to spatially connectedcomponents. Therefore a spatial coefficient (SC) is constructed to measure suchspatial connections. If we would know even more about the neurons, for examplehow the neurons are connected in the brain of the zebra fish with their axons anddendrites, such a SC could be even more accurate.

A big advantage of the single neuron imaging that is used to create the data fromthe zebra fish, is that there is a very high spatial accuracy. The location of everyneuron is known. With this information a SC can be created.

3.9.1 Creation of the Spatial Coefficient

A first possibility can be clustering related measures for such a SC. One could forexample take the coordinates of the neurons as data and the result of the time seriesclustering as labels. Then one could compute the silhouette value or another internalvalidation on this ’clustering’. However, as can be seen in for example Figure 3.42,the neurons are spatially connected, but not grouped in one spatial cluster. Theyare split up in several groups of spatially connected neurons. One can for exampleexpect that the brain works kind of symmetric. There will be a part on the left sidethat has the same function as on the right side (for basic functions). The time serieswill ensure that both such symmetric parts are clustered in the same cluster. Thiswill create two groups of neurons. Both groups are not connected but in each groupthe neurons are spatially connected. This is still a good solution but a clusteringmeasure would fail on such data.

Another possibility would be the use of the aggregation index [42], however thedata is not structured in a way suited for this index.

Therefore a new coefficient, suited for this zebra-fish data is created. Everyneuron gets score. This score is the ratio of the Nn spatially nearest neighbors thatare in the same cluster. This individual scores do not have that much information.The Spatial Coefficient of a cluster is given by the mean of all this ratios in onecluster. The SC of the whole clustering is then the mean of all the ratios of all theclusters, or the weighted mean of the SC of the individual clusters. Nn is chosen tobe 5 in this case. This number has to be quite small because spatially connectedgroups neurons in the brain can be quite thin. If this number of nearest neighborswould be higher, such thin clusters would be neglected. On the other hand, if thisnumber would be lower, also too thin clusters would get a high SC. This number ishowever still quite arbitrary in the sense of what a thick and and a thin spatiallyconnected group of neurons is. To demonstrate the values the mean ratio of a thin,a just thick enough and a thick cluster is computed for different number of nearest

67

3. Methods

neighbors. The brain maps are shown in Figure 3.43 and the mean ratios from theyellow group are shown in the following Table:

Number of nearest neighbors Too thin Just thick enough Thick

2 0.6023 0.8304 0.93855 0.4591 0.7791 0.913510 0.3523 0.7226 0.8837

Table 3.9: Number of nearest neighbors versus thickness of the cluster.

Too thin

50 100 150 200 250

50

100

150

200

250

300

350

400

450

Ok

50 100 150 200 250

50

100

150

200

250

300

350

400

450

Thick

50 100 150 200 250

50

100

150

200

250

300

350

400

450

Figure 3.43: A too thin (yellow) group of spatially connected neurons, a just thickenough one and a big one.

3.9.2 Normalization of the Spatial Coefficient

Every cluster C will get a spatial coefficient SCC . Without normalization this SCCis however very dependent on the number of neurons in that cluster. For a very largecluster, it is almost impossible to get low ratios. Therefore the expected value issubtracted from the SCC of every cluster C. This expected value can be computedrather easy. With N the total number of neurons, L the number of neurons in thecluster C and Nn the number of nearest neighbors considered, the expected value forcluster C is given by

EC(Nn, N, L) =Nn∑m=0

m

NnP (Nn, N, L,m) (3.38)

with probability P (Nn, N, L,m) given by

PC(Nn, N, L,m) =(Nnm

)∏mi=1(L− i)

∏Nn−mj=1 (N − L− j + 1)∏Nnl=1(N − l)

(3.39)

68


However a perfect or almost perfect clustering should still get a high SC. Whena cluster has a lot of neurons, the subtracted expected value is already very high.To prevent a low score in such a case the final normalized SCC of cluster C with Lneurons is given by

SCnorm,C = SCC − ECmax(L)− EC

(3.40)

This max(L) is of course always given by 1. However not every spatial neurondistribution gives the possibility for this maximum. Take for example a cluster thatexists out of 6 neurons. To achieve the maximum of 1 would require that everyneuron in that cluster has all the other 5 neurons as nearest neighbor. This is onlypossible in a very specific spatial distribution, where the 6 neurons are spatially quiteseparated from the other neurons.

Another possibility would be to measure the maximum SC over different randomdistributions with the same overall density. Taking the mean to compute Max(L)gives also good results, but leads sometimes to normalized spatial coefficients higherthan 1 which is of course not wanted.

Several tests are executed to test this SCnorm. A first test needs to assesswhether random data indeed gives a low SCnorm. To get a good average 100 runsare computed. In every run, 3 clusters are created 965 times with varying sizes. Inevery run cluster 1 has size 1 to 965. Cluster 2 gets 2

3 of the rest and cluster 3 gets13 of the rest. All the neurons are however divided at random. The result can beseen in Figure 3.44.

In the second test, also a good cluster is tested. All the neurons, spatially locatedstarting from the lower left corner and within a certain radius belong to cluster 1.The rest of the neurons are randomly divided into 2 other clusters. This radius growswith each iteration so that more and more neurons are clustered in cluster 1. Thistest is run a 100 times. Every run all the neuron-locations are randomly chosen. Twosuch realizations are shown in Figure 3.45. This is important, the exact locationsof the neurons can have quite a bit of influence on the SC. The mean results areplotted in Figure 3.46. In the plot without normalization one can see that cluster 1always scores quite good but that also cluster 2 and 3 score above 0. This is solvedby subtracting the expected value. Now cluster 2 and 3 have score 0 when all theneurons are combined in those two clusters (the moment when there are 0 neuronsin cluster 1). When cluster 1 grows, also the score of clusters 2 and 3 increases.Although the neurons are randomly divided between these two clusters, they arenot totally random. They can never be in the area of cluster 1. This is the reasonwhy their score should correctly increase and be above zero. The score of cluster1 is however not so good when the expected value is subtracted. When cluster 1grows, the expected value increases too much so that the score of cluster 1 decreases.For this reason all the scores are divided by the theoretical maximum. This solvesthe problem. Cluster 1 has almost always perfect score. When cluster 1 is very bigthe normalized SC decreases again. The expected value has grown too high andthe theoretical maximum can only be reached with a very specific allocation of theneurons. Also when the cluster is very small its score is low. The cluster is too ’thin’

69

3. Methods

there to be important or get a good score.

0 200 400 600 800 1000−12

−10

−8

−6

−4

−2

0

2

4

6

8x 10

−3

size of cluster 1

SC

nor

mal

ized

Normalized Spatial Coefficient

cluster 1cluster 2cluster 3

0 200 400 600 800 10000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

size of cluster 1S

C

Unnormalized Spatial Coefficient

cluster 1cluster 2cluster 3

Figure 3.44: The mean SC value of random clusters. The scores are plotted againstthe size of cluster 1. On the left side the normalized SC is used and on the right

side the unnormalized SC. The normalized SC values stay approximately 0.

0 100 200 3000

50

100

150

200

250

300

350

400

450

500

x

y

One realization

cluster1cluster2cluster 3

0 100 200 3000

50

100

150

200

250

300

350

400

450

500

x

y

Another realization

cluster1cluster2cluster 3

Figure 3.45: 2 localizations of the neurons of 2 runs. In every run it is the iterationwhere 200 neurons are grouped in cluster 1.

70


0 100 200 300 400 500 600 700 800 900 10000

0.2

0.4

0.6

0.8

1

size of cluster 1

SC

nor

mal

ized

Unnormalized Spatial Coefficient

cluster 1cluster 2cluster 3Total

0 100 200 300 400 500 600 700 800 900 1000−0.2

0

0.2

0.4

0.6

0.8

1

size of cluster 1

SC

− E

(C)

Spatial Coefficient after substraction expected value


0 100 200 300 400 500 600 700 800 900 1000−0.2

0

0.2

0.4

0.6

0.8

1

size of cluster 1

SC

Normalized Spatial Coefficient


Figure 3.46: The mean value of different scores of clusters like figure 3.45 overdifferent neuron distributions. The scores are plotted, compared with the number ofneurons in cluster 1. The normalized SC correctly scores the random clusters and

the good cluster.

71

Chapter 4

Results

In this chapter the clusterings will be validated and compared. First the normalizedSC will be computed from every clustering and this score will be complemented withthe size of each cluster. After this all the clusterings will be compared to each other.It is however impossible to create figures where all the algorithms are compared,combined with writing their full names. Therefore all the algorithm names areabbreviated in Table 4.1.

Abbreviation Description

KMC K-means with correlationKME K-means with Euclidean distancesKMFBP K-means band pass filtered (with correlation)KMFLP K-means low-pass filtered (with correlation)AR K-means on the auto-regression parametersKMHCN K-mean on the high correlation neurons (with correlation)KMN K-means on the normalized data (with Euclidean distances)TICA Temporal independent component analysisSICA Spatial independent component analysisFUZN Fuzzy c-means on the normalized data (with Euclidean distances)FUZC Fuzzy c-means with correlationSRBFN Spectral clustering with RBF similarity on the normalized dataSRBF Spectral clustering with RBF similarity on the unnormalized dataSNN Spectral clustering with nearest neighborsSMNN Spectral clustering with mutual nearest neighborsHIER Hierarchical clusteringNG Neural gas algorithm

Table 4.1: Abbreviation of the algorithms.

73

4. Results

4.1 The external validation

To validate the clusterings, various methods are possible. In Chapter 3 differentinternal validation measures are tested. Those are however quite algorithm specificand do not always give a clear answer. Another option would be to assess an expertsopinion, but such an assessment of all the clusterings is very cost ineffective. In thirdoption is an external validation measure. In this case the normalized SC will beused that is presented in Section 3.9.

In this section the outputs of the algorithms will be tested with the normalizedSC. However, some clusterings may cheat by creating very small clusters. Dividingthe neurons into two groups is quite obvious. This would be the division of theneurons in an active and an inactive part. However, finding more clusters can bemore difficult. Therefore some algorithms basically just keep the two main clustersand add some ’outliers’ to a third cluster. With this technique they manage to keepa high normalized SC (almost the same as the normalized SC with 2 clusters), butthis would be no good clustering for three clusters. Therefore clusterings that createclusters with less than 50 objects are considered bad clusterings and are given anormalized SC of 0. This number is chosen because a small deviation from it wouldnot make a difference. When the size of the smallest cluster is smaller than 50 itmostly is way smaller (less than 25). On the other hand, there are a lot of algorithmsthat produce a smallest cluster with a size between 50 and 80 neurons. This borderis chosen to be a hard border. Another option would be to gradually penalize smallclusters. However the clusterings with a smallest cluster size between 50 and 80 arenot necessarily bad clusters.

In Figure 4.1 the normalized SC of these clusterings are shown combined withthe size of the smallest cluster. Only the results for 2− 5 clusters are shown becauseall the validation measures implied such an optimal number of clusters. Obviouslythe minimum size of each clustering decreases when the number of clusters increases.But also the normalized SC mostly decreases with a higher number of clusters. It isway easier to divide the data into two than into more proper clusters. The divisioninto two clusters gives almost always a higher normalized SC while there is lessdifference between the other number of clusters.

It is however also very interesting to see that this is certainly not the case for theFUZN algorithm. The best normalized SC in the stimulus part is found for 3 clustersthere. Remember that the internal validation measure of the FUZN algorithms gaveone of the clearest answers to the question of the optimal number of clusters andthat this was indeed 3 for the stimulus part (see Figure 3.35).

It seems also to be easier to find more proper clusters in the stimulus partcompared to the non-stimulus part. For example the TICA algorithm never findsmore than 2 good clusters in the non-stimulus part but finds even 5 good clustersin the stimulus part. The size of the minimum clusters is mostly lower in thenon-stimulus part. This does however not mean that the normalized SC of thenon-stimulus part is worse than the one of the stimulus part. There are a lot ofalgorithms where the non-stimulus part creates higher normalized SC.

There is no algorithm that performs drastically better than the other algorithms,

74

4.1. The external validation

there are however algorithms that perform significantly worse than others. TheKMFBP algorithm performs not so good in the non-stimulus part. It is howeverremarkably that it does perform good in the stimulus part. The higher frequenciesmight be more important in that part. Also the SRBF and KME algorithm do notperform very well. They mostly do not find good clusters with a higher number ofclusterings. They both have in common that they use the raw data. The data is notnormalized and they do not use correlation (which also implicitly normalizes thedata).

In Figure 4.2 all the best clusterings are shown. SNN is the only algorithm thatis more than once the best choice. The division into 2 clusters is very different forthe stimulus and the non-stimulus part. With 3 clusters however the stimulus andthe non-stimulus part are more alike. To show the effect of the normalized SC alsoa bad clustering is plotted in Figure 4.3. The blue and the red cluster are totallymixed which is the reason for the low normalized SC.

To rank the algorithms a ranked voting system will be used (like in the EurovisionSong Contest). The algorithms are ranked for each number of clusters (2 to 5) andfor each part by their normalized SC. The best algorithm gets 19 points, the second17 the third to last get 15 to 1 points. If no real clustering is produced (less than50 neurons in the smallest cluster), 0 points are given. Taking just the mean of thenormalized SC not a good idea. The deviation between the SCs is not the same foreach number of cluster or data part (stimulus or non-stimulus). When an algorithmscores extremely high one time and low all the other times it still would get a goodresult in such a case. Then again, taking the median would give not enough influenceto a very good or very bad SC. Therefore a ranked voting system is a better option.The results are shown in Table 4.2

The contemporary algorithm, the spectral clustering with nearest neighbors seemsto be the best in this ranking. It is however followed very closely by the good oldk-means (with correlations) algorithm and the hierarchical clustering algorithm. TheSNN algorithm is clearly the best in the stimulus part but actually doesn’t performthat well in the non-stimulus part. There the KMC algorithm is the best option.The Hierarchical clustering algorithm performs quite good in both the stimulus andthe non-stimulus part. The algorithms with the non-normalized data are clearlythe worst but also the ICA algorithms perform less than most clustering algorithms.This results do however not say everything. It is expected that clustered neurons layclose to each other in the zebra-fish brain but there is still no certainty about this.This ranking gives probably a good indication of good and bad, but can not be seenas an absolute ranking.

75

4. Results

Scores

Ranking Algorithm Total score Non-stimulus part Stimulus part

1 SNN 103 35 682 KMC 98 60 383 HIER 97 46 514 KMFLP 92 55 375 SMNN 90 40 506 FUZC 84 43 417 FUZN 81 40 418 KMN 80 51 299 SRBFN 78 44 3410 NG 78 38 4011 KMFBP 61 18 4312 AR 60 34 2613 KMHCN 59 29 3014 SICA 48 29 1915 TICA 44 13 3116 KME 19 2 1717 SRBF 19 3 16

Table 4.2: Scores of the algorithms

76

4.1. The external validation

KMC KME KMFBPKMFLP AR KMHCN KMN TICA SICA FUZN FUZC SRBFN SRBF SNN SMNN HIER NG0

0.2

0.4

0.6

0.8Normalized SC non−stimulus

Nor

mal

ized

SC


100

200

300

400

500Smallest cluster non−stimulus

Siz

e


0.2

0.4

0.6

0.8Normalized SC stimulus

Nor

mal

ized

SC


200

400

600Smallest cluster stimulus

Siz

e

2 clusters3 clusters4 clusters5 clusters

Figure 4.1: The normalized SC scores of all the different algorithms and variationsfor the stimulus and the non-stimulus. Also the size of the smallest cluster is shown.

77

4. Results

Non−stimulus, 2 clusters: KMN

50 100 150 200 250

100

200

300

400

Stimulus, 2 clusters: SNN

50 100 150 200 250

100

200

300

400

Non−stimulus, 3 clusters: KMHCN

50 100 150 200 250

100

200

300

400

Stimulus, 3 clusters: FUZN

50 100 150 200 250

100

200

300

400

Non−stimulus, 4 clusters: HIER

50 100 150 200 250

100

200

300

400

Stimulus, 4 clusters: SMNN

50 100 150 200 250

100

200

300

400

Non−stimulus, 5 clusters: KMFLP

50 100 150 200 250

100

200

300

400

Stimulus, 5 clusters: SNN

50 100 150 200 250

100

200

300

400

Figure 4.2: The algorithms with the best normalized SC for the stimulus andnon-stimulus part and for number of clusters 2 to 5.

78

4.2. Comparison of the algorithms

Non−stimulus, 4 clusters: AR

50 100 150 200 250

50

100

150

200

250

300

350

400

450

Figure 4.3: The brain map of the AR variation for the non-stimulus part for 4clusters.

4.2 Comparison of the algorithms

In this section the algorithms will be compared to each other with the adjusted randindex (see equation 3.20). This comparison is thus computed on the basis of how theneurons are clustered by the algorithms. Another possibility would be to comparethe cluster centers. This cluster center does however not have a meaning for eachalgorithm (e.g., in spectral clustering) or can’t even be constructed (e.g., in AR).

The ARI relatively do not differ too much between the number of clusters andbetween the stimulus and non-stimulus part. If two algorithms are similar for 2clusters, they are also quite similar for more clusters. The evolution of the mean ARIbetween all the algorithms can be seen in Figure 4.4. The overall trend is that themean ARI decreases when the number of clusters increases. This is logical becausewhen there is no real cluster structure left, when the algorithms look for more thanfive clusters, the algorithms cluster almost random data. Such random data canbe clustered in many ways. It is however interesting to note that for the stimuluspart, the highest mean ARI is for 4 clusters. There seems to be quite an agreementbetween the clustering algorithms for this number of clusters.

All the ARI from both parts and number of clusters two to five can be averaged to

79

4. Results

create a similarity matrix that represents all the clusterings. After all this clusteringwork, the logic reflex is to create a new clustering with such a matrix. Hierarchicalclustering with average linkage is used for the clustering so that also the underlyingstructure of the algorithms can be explored. This clustering has several advantages.After the clustering of the algorithms, the similarity matrix can be restructured sothat it can be shown much more clear. This is shown in Figure 4.5. The second moreobvious reason is that it is interesting to see which algorithms belong together. Thedendrogram of this clustering is shown in Figure 4.6.

2 3 4 5 6 7 8 9 100

0.05

0.1

0.15

0.2

0.25

0.3

0.35

Number of clusters

Mea

n A

RI

Non−stimulusStimulus

Figure 4.4: The mean ARI.

When this dendrogram is cut to create six clusters it is very easy to explain thisstructure. One finds three clear clusters of algorithms and three extra outliers, oralgorithms that are very different from the others.

A first cluster consists of algorithms KME and SRBF, both clusters that per-formed very bad in the SC analysis. This are the only two algorithms that use theunnormalized data. KME might use a different distance measure, the distance is still

80


performed on the same raw data where the original amplitudes are still important.This explains this first cluster.

A second cluster is created by the TICA, SNN, HIER, KMHCN, KMC, FUZCand KMFLP algorithms. This includes almost all the k-means variations that usecorrelation as similarity measure. Also SNN and HIER use the correlation matrixas input. The TICA is not really a clustering method, but also temporal ICAuses the correlations between the time series to produces the different independentcomponents.

A third cluster of algorithms are the algorithms that use the normalized data asinput. This are SRBFN, FUZN, KMN and NG. Although very different algorithms,spectral clustering, k-means, fuzzy and neural gas, all these algorithms belong to thesame cluster because they use the normalized data as input.

Note that the Nn-nearest neighbors algorithms get the same result when they rankthe nearest neighbors with correlation as with Euclidean distance on the normalizeddata, so they are somewhere between these two groups. They are also one of the lastalgorithms that get clustered to their group. However also the dendrogram createdin hierarchical clustering is the same for the correlation as for the Euclidean distancewith normalization.

There are also three outliers, or algorithms that perform very differently comparedto the other algorithms. These algorithms have an average ARI of maximum 0.3before they get merged to a cluster. A first algorithm is the KMFBP, the algorithmthat only uses frequencies between 0.1Hz and 0.2Hz. This data is fundamentallydifferent from all the other algorithms, it even needs an average ARI as low as 0.0820before it gets merged. Also the AR and SICA algorithms perform different fromthe others. Both merge together with the cluster that uses normalized data andthe ’correlation’ cluster before they merge with the other algorithms. This is logicalbecause they both use respectively correlations and normalized data.

When looking closer to the real ARI between the algorithms in Figure 4.5 onecan see that the KMN and NG algorithm are very similar. Also the low-passfiltering has not so much effect, the KMFLP algorithm is almost the same as theKM algorithm. This also means that the neurons could have been measured on amuch lower resolution (four times as low) without loss of any information. Also inthis figure the three clusters and the three outliers are quite good visible.

It is interesting to note that the SNN algorithm, which was the best algorithm inthe SC comparison has no single ARI higher than 0.5. Although this algorithm alsouses correlations, it is still quite different from all the other algorithms.

The closest algorithm to the HIER algorithm is the KMHCN variation. Bothalgorithms deal with outliers. KMHCN deletes them before the clustering and HIERthrows outlier clusters away.

It also interesting to take a closer look at the outliers. For example the KMFBPdoes not score really bad in the comparison with the SC. It gives however differentresults as the other algorithm. Two brain maps are compared in Figure 4.7. A bigdifference is for example that in the KMFBP there is symmetry in the red clusterbetween the left and right side. This is however not the case in the SNN variation.These two clusterings have an ARI of 0.17.

81

4. Results

Adjusted rand indexes

KMN NG FUZN SRBFN SMNN KMC KMFLP FUZC KMHCN HIER SNN TICA SICA AR KME SRBF KMFBP

KMN

NG

FUZN

SRBFN

SMNN

KMC

KMFLP

FUZC

KMHCN

HIER

SNN

TICA

SICA

AR

KME

SRBF

KMFBP

0

0.2

0.4

0.6

0.8

1

Figure 4.5: The ARI between the algorithms

82


0.0

0.2

0.4

0.6

0.8

Dendrogram of the clusterings

Dis

tanc

e

KM

FB

P

KM

E

SR

BF

AR

SM

NN

SR

BF

N

FU

ZN

KM

N

NG

SIC

A

TIC

A

SN

N

FU

ZC

KM

C

KM

FLP

KM

HC

N

HIE

R

Figure 4.6: A clustering of the algorithms.

83

4. Results

Brainmap: KMFBP

50 100 150 200 250

50

100

150

200

250

300

350

400

450

Brainmap: SNN

50 100 150 200 250

50

100

150

200

250

300

350

400

450

Figure 4.7: Two brain maps of the stimulus part with 5 clusters. On the left sidethe KMFBP algorithm and on the right side the SNN algorithm.

4.3 ConclusionThe conclusion of this chapter is very straightforward. The distance or similaritymeasures the algorithms use are very important, maybe even more important thanthe clustering algorithms themselves. The four best performing algorithms were allalgorithms using the correlation measure. This seems to be a much better option thanthe Euclidean distance. The use of Euclidean distances after normalization is alsonot such a bad idea. It is however striking that only few articles from the literaturereview compare distances or even justify their choice of the distance measure. Inthe single paper that distances are compared, correlation is also found to be thebest option. They did however only compare the fuzzy c-means algorithm [25]. It isclear that there should go more thought and energy in this fundamental question,especially for the clustering of time series. This however does not mean that allthe algorithms perform exactly the same if they use the same distance measure.Different algorithms with the same distance measure are however ranked very closeto each other. There seems not to be an optimal choice for both parts and for allthe different number of clusters.

84

Chapter 5

Conclusion

5.1 The algorithms

In Table 5.1 the results from previous chapters are summarized. All the results areonly given by the signs + and − to make everything more clear.

The KME algorithm is the absolute worst algorithm. It takes very long tocompute and has the worst results when one looks at the SC.

The best algorithm is probably the SNN variation. It is the best algorithmaccording to the SC ranking and it has a low computational cost. Also the similaritymatrix can be kept very small for memory issues when the sparse implementation isused. This can be very important for later uses. This data set had only 965 neuronsand time series of length 650. It is very probable that for later uses more neurons willbe measured. When the time resolution would increase by using new technologies,also the time series would become longer. By creating the sparse similarity matrix afar bigger data set can easily be processed. There has not yet been any literatureabout the clustering of neural time series with this spectral clustering technique.However, this does not mean that the method is unpopular. On the contrary, inrecent years there has been written a lot about this kind of algorithm. This algorithmcan however bring new insides and more accuracy to the world of clustering braintime series. Note that all these remarks also count for the SMNN algorithm, whichalso scored very good on the SC ranking.

The KM and KMFLP scored good on the SC ranking. Their computation ishowever heavier because a lot of replicates are needed for the long time series. Inspectral clustering, this k-means needs only to be run on the eigenvectors which haveonly a length of the number of clusters. This is of course way faster. Note also thata lower resolution would not harm the clustering results. The KMFLP algorithm isa quite good solution.

Also hierarchical clustering is a good option. It scored quite good on the SCranking (third place) and also is quite fast. In the beginning of the research on theclustering of fMRI data, this algorithm was avoided because the similarity matrixscales so bad. This is still true but contemporary computers can handle this better.A big advantage is also that one can make dendrograms with which one can discover

85

5. Conclusion

the underlying structure of the data. Two neurons might belong to a different cluster,but these two clusters can still be very similar. This can be seen in a dendrogrambut would not be visible in a hard clustering like KM or SNN.

Also the fuzzy c-means algorithm has an advantage in a comparable matter. Theneurons can belong to several clusters which is biologically more correct. This cangive very interesting information that is not available in hard clustering algorithms.The external validation measure is however suited for a hard clustering, which maybias the result of the fuzzy c-means algorithm.

The KMHCN variation has the advantage that it is very fast and can easily beapplied on a much larger data set. The final cluster centers are also more clearbecause they are not clouded by the LCN.

5.2 Distance measure

The most important conclusion concerns however the distance measure or thepreprocessing technique. In this thesis three big options are discussed. First thereis the use of the raw data, this also incorporates the difference in amplitudes. Thisamplitudes are clearly not the most important clustering feature according to thenormalized SC. When the RBF similarity measure or the Euclidean distance is usedon the raw data, the clustering fails. Both algorithms fail to produce acceptableclusters. They are only able to split the set in an activated cluster and non-activatedcluster. If more clusters are asked, both the algorithms fail to produce them.

A second set of algorithms normalizes the data first. The amplitudes are not im-portant anymore. These algorithms perform significantly better than the algorithmsthat use the raw data.

A third set of algorithms uses correlation as distance measure. This is clearlythe best choice according to the normalized SC. A second important advantageof these correlation methods is that the correlation is much easier to compute asthe Euclidean distance. If the data is normalized, the correlation reduces to a dotproduct, which is computationally really fast. It is striking that so few previousresearch about clustering brain data discussed the choice of the distance measureor the preprocessing techniques. Most papers just assume that their method is thebest, unfortunately a lot of such research used the Euclidean distance.

Contemporary research on clustering fMRI data use hierarchical clustering withcorrelations a lot. This is already a good start, but as shown in this thesis thiscan be improved by using spectral clustering. For bigger problems, the sparseimplementation would also computationally be a better choice compared to thehierarchical clustering.

5.3 Future work

One can go in three directions for the future work. A first path is the more profoundresearch on the best clustering algorithm. More contemporary (kernel) clusteringalgorithms can be added and more distance measures can be tested. Only one

86

5.3. Future work

Algorithm Group SC Cost Litterature Extra

KMC Correlation + + - ?KME Raw data - - - - +KMFBP Other - - - ? Intersting other frequenciesKMFLP Correlation + + + ? Lower resolution requiredAR Other - - - ?KMHCN Correlation - + + Unclouded cluster centersKMN Normalized + - - ?TICA Correlation - + -SICA Other - + -FUZN Normalized - + - No hard clusteringFUZC Correlation + + + + No hard clusteringSRBFN Normalized + + ?SRBF Raw data - - + ?SNN Correlation + + + ? Sparse implementationSMNN Normalized + + + ? Sparse implementationHIER Correlation + + + + + Dendrogram possibleNG Normalized + + + +

Table 5.1: Summary of the results of the algorithms. The group indicates how thealgorithm was clustered. For the SC (see Table 4.2, rank 1 to 5 get a ++, rank 6to 10 a +, rank 11 to 15 a - and rank 16 and 17 a –. For the cost, ++ means acomputational time less than 100s, + a computational time between 100s and 1000s,- a computational time between 1000s and 10000s and - - a computational time over10000s. For the literature, ++ means always good results, + means mostly ok results

and - means mostly bad results in previous research.

contemporary algorithm (spectral clustering) is introduced. However, maybe othercontemporary algorithms perform even better.

A second path belongs more to the area of data mining. It is expected thatthe number of neurons modern functional imaging techniques can measure willonly increase [1, 55]. In the future it may be possible to use such techniques ona human brain. A human brain contains 15 − 33 billion neurons [56]. To clusterwith such numbers the usual algorithms in this thesis are of no use. Only a sparseimplementation like the spectral clustering algorithm may be possible. However otherdata mining clustering techniques are probably better suited. Birch can be used forthe hierarchical clustering [74]. Another well known algorithm for large data setsis the Cure algorithm [28]. For this one needs to research how the algorithms scalefor more data. One could generate random data and test the algorithms on such adata set. The clustering algorithms are however also dependent on the underlyingstructure (e.g., the stimulus part is almost always computed faster compared to thenon-stimulus part), so it would be better to test such behavior on real data.

87

5. Conclusion

5.3.1 Moving clusters

One possible option is to cluster the whole time series at once. However, thereis a possibility that the clustering changes over time. For example, when stimuliare added to the zebrafish, the neural circuits might react in an other way. Someneurons, quiet before the stimuli, might get activated together and form a new cluster.Other clusters can get compacter and even others can completely disappear. A thirdinteresting path is to look into this cluster change.

Moving Clusters

A first way of looking into cluster change is to assess moving clusters. From thisviewpoint one tries to find clusters that move over time. Li et al. [40] used next tothe location of objects also the velocity of objects to cluster them . This way objectsthat move together are clustered together. The use of velocity is however rathercomplicated with time series.

Another way of working with moving clusters is to try to ignore overlappingclusters. When two moving clusters are crossing paths, Rosswog et al. [59] try tostill cluster the objects into their real clusters, without creating one big Cluster.

Another way to look at moving clusters is to examine how many objects fromcluster Ca,i at time ti are present in a cluster Cb,i+1 at time ti+1. Kalnis et al.described a moving cluster using such a definition [38]. The MONIC, modelingand monitoring cluster transitions, framework uses the same thoughts but for farmore possible cluster changes [67]. They give methods to detect cluster splits,merging clusters, new rising clusters and old dying clusters. The framework alsodescribes internal cluster transitions. In this case, the internal features of the clustersare changed. For example the number of objects in a cluster can change or thecompactness of a cluster can change. Finally they also research the lifetime of theclusters they find. It would be interesting, but probably quite difficult to research allthese transitions in a clear way.

There have been two attempts to improve this framework. The first one wasproposed by Hawwash et al. [31]. They adapted the framework so that it would alsowork for streaming data. A second much simpler adaptation was made by Oliveiraet al. [52]. They used bipartite graphs to represent the clustering over time. Thisway one can research the external or internal transitions but one can also visualizeall the cluster changes in a comprehensible graph.

Evolutionary Clustering

Suppose one can do a clustering based on two features of the objects. Over time,both are equally accurate. However, every even time step, the clustering with featureA is slightly better and on every odd time step, the clustering with feature B isslightly better. Normally this would result in a clustering that would totally changeevery time step. This clustering sequence would result in weird behavior when it isanalyzed with unchanged methods from this thesis. Every time step, old clusterswould disappear and new ones would rise. Evolutionary clustering tries to reduce

88

5.3. Future work

this problem. Chakrabarti et al. [14] define this as ’the problem of processingtimestamped data to produce a sequence of clusterings; that is, a clustering for eachtime step of the system. Each clustering in the sequence should be similar to theclustering at the previous time step, and should accurately reflect he data arrivingduring that time step’ .

One solution may be incremental k-means but a better option is evolutionaryclustering [14, 15]. A first possibility is to use an evolutionary k-means algorithm.This ensures that future cluster centers do not differ too much from previous clustercenters. This can however not be such a good option in the case of the neural timeseries. Cluster centers can change very fast when stimuli are applied. It is howeveralso possible to implement spectral evolutionary clustering which has to option tokeep the similarity between one clustering partition and next high, using a rand likeindex. While the cluster centers, the brain activation, changes, the neurons stay in asimilar cluster.

89

Appendix A

Paper

91

Comparison of algorithms, distance measures andpreprocessing techniques for the clustering of neural

data.

Merijn MestdaghKU Leuven

Department of Electrical EngineeringESAT

Kasteelpark Arenberg 10 postbus 2440B-3001 Heverlee (Leuven)

BelgiumEmail: [email protected]

Oscar Mauricio AgudeloKU Leuven

Department of Electrical EngineeringESAT

Kasteelpark Arenberg 10 postbus 2440B-3001 Heverlee (Leuven)

BelgiumEmail: [email protected]

Emre YaksiNERF Founded by IMEC

K.U.Leuven and VIBKapeldreef 75

B-3001 LeuvenBelgium

E-mail: [email protected]

Abstract—Unsupervised exploratory data analysis has beenconducted many times on fMRI and PET data. In this paper,clustering methods are tested on a new kind of neural data.Previous research failed to compare distance measures, prepro-cessing techniques and algorithms which are all quite importantin the clustering of neural time-series. Also more contemporaryalgorithms have not yet been used in similar research. Herethe k-means algorithm will be used as basic algorithm. Firstmultiple variations of this algorithm will be tested, includingmultiple distance measures and preprocessing techniques. Lateralso other algorithms, like fuzzy c-means clustering, neural gasclustering, hierarchical clustering and spectral clustering, will beadded tot the comparison. Also independent component analysiswill be tested. All the variations and algorithms are tested on theircomputational cost, and the spatial connectivity of the resultingclustering. Spectral clustering using nearest neighbors measuredwith correlations was the optimal choice. In fact all the algorithmsusing the correlation as similarity measure performed aboveaverage on both criteria, the computational cost and the spatialconnectivity. It is however not recommended to use Euclideandistances on this kind of data without a prior normalization.

I. INTRODUCTION

There is a long tradition of clustering the brain in severalparts and studying the activation patterns of those clusters.There are a lot of such studies about PET or fMRI data. Thebrain imaging techniques have however matured and one isnow able to record the brain on a single-neuron level [1], [30].Such new techniques urge however for a new discussion on theclustering techniques used to analyze this data. The high spatialresolution of this techniques give the possibility to assess thespatial connectivity of the clustered neurons which is ideal foran external validation measure.

The data produced by such neurons are time series. Liao[22] describes that choosing a good clustering for time seriesinvolves three important choices . One needs to make adecision about the similarity measures, the algorithm and thecluster evaluation criteria. There are way to many possiblecombinations of such choices to compare them all. A goodstarting point is the similar literature, written about the clus-tering of fMRI or PET data sets.

The first time exploratory data analysis was performed onthis kind of data, principal component analysis (PCA) wasused. This method is used to extract the functional patternsout of the data. PCA was first used in 1991 on a PET dataset by Moeller et al. [29] and later in 1994 on fMRI databy Sychra et al.[36]. The problem with this method that triesto explain as much variance as possible, is that most of thevariance might not be explained by the task-related process orother interesting processes [26]. Even more, a lot of varianceis explained by instrumental or physiological noise [34].

Independent component analysis (ICA) tries to find theoriginal sources from a mixture of sources. The analysisdoesn’t work with orthogonal components as the PCA, butassumes that the underlying sources are independent. It alsodoesn’t try to explain as much variance as possible as thePCA and is therefore seen as a better option for this kindof data [34], [26], [7]. Calhoun et al. [8] give a summaryof the numerous applications using ICA in combination withfMRI data. They divide the ICA into two parts: temporal ICA(TICA) and spatial ICA (SICA). Liu et al. [23] comment thatalthough ICA is used more and more in resting state fMRIanalysis ([21], [37], [7]), there is still no empirical evidencefor the assumption of independent sources made by ICA.

Baumgartner et al. [3] were in 1998 one of the first to useclustering as an exploratory data analysis on fMRI data . Theyused a fuzzy c-means clustering algorithm for this purpose.They used the Euclidean distance as distance measure andargued that this was a good option because it is also able todifferentiate levels of activation. Also in 1998, Golay et al. [14]compared distance measures for the fuzzy c-means algorithmand found that the correlation was a better method than theEuclidean distance to cluster the time series. Somorjai et al.[34] proposed a whole framework to handle the clustering withthe fuzzy c-means using Euclidean distances. They used a fastfuzzy c-means method which includes a lot of preprocessing.First they normalized the data. After this they did a preselectionto exclude less interesting time series. They excluded a lot oftime series to have a faster algorithm. Speed seemed to be veryimportant in their framework.

In 1999 Filzmoser et al. [13] were the first to propose a hi-erarchical clustering algorithm for the clustering of fMRI data.They used Euclidean distances to create the similarity matrix.They could however not implement the default agglomerativealgorithm because the quadratically growing similarity matrixwas impossible to store in that time.

In 2004 a comparison between the cluster analysises offMRI data was published by Dimitradou et al. [11]. Theycompared the methods that were mostly used in that time.For the non-hierarchical methods the neural gas algorithmperformed best, closely followed by the k-means algorithm.The fuzzy c-means algorithm showed not to be a good option.The hierarchical algorithm with ward-linkage had a similaror even better performance. Because this algorithm was com-putationally heavy Dimitradou et al. preferred the neural gasalgorithm. However, this meta-analysis might be insufficientto draw conclusions. They only used one kind of distancemeasure (the Euclidean distance). In the same year, MeyerBaese et al. [27] also performed kind of a meta-analysis. Theycompared three ICA methods with three clustering algorithms.They compared the task-related activation maps with associ-ated time-courses and receiver operating characteristics. Thebiggest advantage of the ICA was that it was a faster algorithm.Besides that, the clustering algorithms mostly performed better.In particular, again the neural gas algorithm with Euclideandistances, gave very good results.

In the last decade, hierarchical clustering is used more andmore for resting state fMRI analysis. Cordes et al. [9] didthis in combination with correlations and single linkage. Thetime series that had barely any correlation with any other timeseries were excluded. This was done to make the algorithmcomputationally more interesting. Also Liu et al. [23] usedhierarchical clustering (this time with average linkage) toanalyze resting-state fMRI data. Clusters with less than eighttime series were excluded because it was found unlikely thatthey would represent meaningful spatial patterns. Liu et al.used a interesting way to compare and evaluate their results.They used the aggregation index from He et al.[16]. This isgiven by the number of shared edges divided by the maximumnumber of shared edges. It is believed that clusterings with ahigher aggregation index are more meaningful than clusteringswith spatially random distributed time series. This is of coursenot an accurate measure. The results from their hierarchicalmethod was compared with the results from an ICA analysis.The hierarchical method had significantly higher aggregationindex values.

There has not yet been any comparative study that includesdifferent distances as well as different algorithms to clustertime series produced by the brain.

II. DATA

The data is collected from the full forebrain of the zebrafishusing single-neuron recordings. The data is recorded fromliving zebra-fish that are undergoing multiple food odor stimuli[38]. 965 different neurons are measured and this at a temporalresolution of 2 Hz. It is expected that the brain of the zebra-fish will react to the stimuli, therefore the the time series ofthe single neurons will also differ. It is very probable that theclustering of the brain is different in those two parts, therefore

the data is split in a no-stimulus and in a stimulus part. Eachpart will consist out of 650 time steps or 325 seconds.

III. METHODOLOGY

A. Algorithms

1) K-means clustering: The k-means algorithm is prob-ably the best known algorithm. Steinhaus was the first todescribe this method in 1956 [35] but it is still very popularafter more than 50 years [20]. This will be the basic algorithmwhere the most variations are tested on. To make the paperclearer, other algorithms are only tested with a limited amountof variations. All the algorithms will get an abbreviation so thatthey can be plotted in figures without needing to write theirfull name. At first three variation that use the correlation assimilarity measure will be tested. One that uses all the neuronswithout preprocessing (KMC), one that normalizes and thenfilters the data with a low-pass zero-phase forward butterworthfilter (KMFLP) and one that only uses neurons that have highcorrelations with other neurons (KMHCN). Of course also themost used squared Euclidean distance will be tested. One timeon the raw data (KME) and one time after normalization ofthe time series (KMN). It is a well known fact that one hasto run the k-means algorithm multiple times with differentrandom starting centers if one wants to approach a globalminimum. In this paper 200 replicates were done for everyvariation. More would still improve the algorithm but wouldalso increase the duration of the algorithm. This would makeit computationally very bad. For the KMFLP variation, the cutoff was chosen as low as possible, but in a way that the peaksof the time-series were still visible. These peaks can be a veryimportant feature to cluster the time-series. For the KMHCNvariation, the good highly correlated neurons where chosen, sothat in a multidimensional scaling plot with three coordinates,a clustering structure became apparent. For all the algorithms,the optimal number of clusters was decided with Dunn’s index[12], the Davies-Bouldin index [10], the mean silhouette value[32] and Hubert’s normalized Γ [18]. This internal validationmeasures did however not agree on the optimal number ofclusters. Most of the times the optimal number lay betweentwo and five. The KME variation seems to be less decisiveand also suggest a higher optimal number of clusters.

2) Hierarchical clustering: A second method is the hier-archical clustering (HIER) [15]. In this case agglomerativebottum-up hierarchical clustering will be used with averagedistance and the similarity matrix will be constructed usingthe correlations between the neurons. A normal hierarchicalclustering with for example five clusters in the no-stimulus partresulted in a very bad clustering of one big cluster with almostall the neurons and several smaller cluster existing out ofbetween 1 and 25 outliers. Therefore, clusters with less than Lneurons were not allowed. Clusters with more than L neuronsare called real clusters. To decide on the final clustering, a cutoff needs to be decided. This cut off is the maximum distancethat is allowed for two clusters or neurons to merge. If thiscut-off rises, the number of clusters can get bigger becausenew real clusters are born out of merged neurons. However,the number of clusters can also decrease again because tworeal clusters may merge together if the cut-off is increased.For a clustering of K clusters, the cut off is decided so thatthere are K clusters and this K clusters have an as long as

possible lifetime. The lifetime is the difference between thecut off where the cluster gets created and the cut off wherethe cluster merges to an older cluster. The neurons that are notclustered in a real cluster for that cut off are clustered to thenearest real cluster. L is standard taken as 20 but is loweredif more clusters need to be created. For L = 20 there are onlyfour clusters with a significant lifetime, so the optimal numberof clusters is also here smaller than five.

3) Spectral clustering: A third method is the more contem-porary spectral clustering. This algorithm is inspired by graphtheory. Spectral clustering has become increasingly popular inthe last five years [24]. Also for this algorithm four variantsare tested. The variations differ only in the way the similaritymatrix is created. First there is the Nn-nearest neighbor method(SNN). The similarity of two neurons is put to 1 if one of themis a nearest neighbor of the other one. The second variationis the mutual Nn-nearest neighbor method (SMNN). Here thesimilarity of two neurons is only put to 1 if both are a nearestneighbor of each other. Nn is the number of neighbors that areconsidered and the correlation was used as similarity measure.The other two variations make use of the radial basis kernelfunction. One variation uses the raw data (SRBF) and onevariation uses the normalized data (SRBFN). To decide theNn for the number of nearest neighbors and the σ needed forthe radial basis funtion, the silhouette value is measured in thefeature space (the eigenvectors produced by the algorithm) andcompared with the size of the smallest cluster. Nn and σ werechosen to give a good compromise between a high silhouettevalue and a not too low size of the smallest cluster. A typicalmethod to choose the optimal number of clusters for spectralclustering is to look for a knee in the values of the sortedeigenvalues. The RBFN and RBFE clustering decided againsta clustering structure and the SMNN and SNN algorithm gaveno clear answer.

4) Fuzzy c-means clustering: A fourth algorithm is thefuzzy c-means algorithm. All the clustering algorithms seen sofar are hard clustering algorithms. This means that there willalways be an all or nothing decision: each object belongs toone and only one cluster. Another category is fuzzy clusteringor soft clustering. In this clustering, an object can belongto multiple clusters. For each cluster it has a membershipvalue. Such an algorithm might be better suited to realisticallydescribe the clustering situation of the zebra-fish. Every neuronhas more than one function and can be part of multiple func-tional sections in the brain. The best known fuzzy clusteringalgorithm is the fuzzy c-means algorithm described by Bezdek[6]. As the name suggests, there is much resemblance withthe k-means algorithm. In this paper it will be used with the(squarred) Euclidean distance on the normalized data (FUZN)and with correlations as similarity measure (FUZC). The extraparameter, the fuzzifier, is computed with the methods ofSchwmammle et al. [33]. The number of clusters was assessedwith the partition coefficient [5] and the minimum centroiddistance [33]. This validation measures preferred three clustersfor the stimulus part and three or five clusters for the no-stimulus part for the FUZN variation. For the FUZC algorithm,two or four clusters were better choices.

5) Neural gas clustering: A fifth and last clustering algo-rithm is the neural gas algorithm [25], it is uses the Euclideandistance on the normalized data. This algorithm has not had

much attention in the last decade but, as described in the intro-duction, performed well in some meta-analyzes of clusteringfMRI data. It is very similar to the k-means algorithm. Theextra parameters were chosen according to the standards ofAlhoniemi et al. [2]. Using the mean silhouette value, thisalgorithm chose two as the optimal number of clusters.

6) Independent component analysis: Last also an indepen-dent component analysis (ICA) algorithm is compared withthese clustering algorithms because it is very common touse such algorithms in the analysis of neural data [19]. Inthis paper a temporal ICA method is used. The SOBI [4]algorithm was used because it performed well in the analysisof Miljkovic et al. [28] where it was used to analyze timeseries of brain related (Electroencephalography) time seriesdata. This algorithm also uses the correlations between the timeseries to find the independent components. An ICA algorithmis however indecisive on the sign of these independent compo-nents. Therefore its correlation is checked with all the neurons.If there are more high negative than positive correlations,the sign of this independent component is reversed. A highcorrelation is defined as bigger than 0.4, but this definition canbe decreased if no single neuron has such a high correlationwith the independent component. The neurons are clustered tothe independent component with which they have the highestcorrelation.

B. Distance measures

In the previous paragraphs, multiple algorithms with mul-tiple distance measures are described. It is however importantto note how closely some of them are related. To normalizea time series, the following equation is applied to each timestep xt from the time series x:

xt,norm =xt − x

Sx, (1)

with Sx the standard deviation of the time series and x themean. The result is that the normalized time series possessesa zero mean and a standard deviation of one. The correlationbecomes then

r(x, y) =x.y

T(2)

and the Euclidean distance becomes

dEuclidean(x, y) =

√√√√T∑

t=1

(xt − yt)2

=

√√√√T∑

t=1

(x2t + y2

t − 2xtyt)

=√

2T − 2x.y =√

2T√

1 − r(x, y).

(3)

This means that there is not so much difference between theEuclidean distance on the normalized data and the correlation.The Nn-nearest neighbor algorithms get the same result whenthey rank the nearest neighbors with correlation as with Eu-clidean distance on the normalized data. Also the hierarchicalclustering gets the same dendrogram with normalized dataas with correlations. The fundamental difference between the

KMC and the KMN variation is that the centroids do not getnormalized every iteration in the KMN algorithm. This meansthat the Euclidean distances from the neurons to the centroidsis not anymore similar to the correlation.

C. An external validation measure

Because the data is gathered on a single-neuron level, thelocation of the all the neurons that produce the time seriesis also known. It is expected that neurons that are clusteredtogether by their time series are close to each other locatedin the brain. This information can be used as an externalvalidation measure. Therefore a spatial coefficient (SC) isconstructed to measure such spatial connections. This score isthe ratio of the spatially nearest neighbors that are in the samecluster divided by the number of spatially nearest neighborsNn. This individual scores do not have that much information.The Spatial Coefficient of a cluster is given by the mean of theratios of all the neurons in one cluster. The SC of the wholeclustering is then the mean of the ratios of all the neurons, orthe weighted mean of the SC of the individual clusters. Nnis chosen to be 5 in this case. This number has to be quitesmall because spatially connected groups of neurons in thebrain can be quite thin. If this number of nearest neighborswould be higher, such thin clusters would be neglected andand if it would be lower, too thin clusters would also get ahigh SC. The SC is however very dependent on the size ofthe cluster. If a cluster is very big, it is almost impossible toget low ratios. The normalized SC of a cluster with L neuronsin a clustering out of a total data set of N neurons is given by

SCnorm,C =SCC − EC

max(L) − EC. (4)

max(L) is given by 1. The expected value can be computedrather easy. With N the total number of neurons, L the numberof neurons in the cluster C and Nn the number of nearestneighbors considered, the expected value for cluster C is givenby

EC(Nn, N, L) =

Nn∑

m=0

m

NnP (Nn, N, L, m) (5)

with probability P (Nn, N, L, m) given by

PC(Nn, N, L, m) =

(Nn

m

)∏mi=1(L − i)

∏Nn−mj=1 (N − L − j + 1)

∏Nnl=1(N − l)

.

(6)

This assures a value of 0 for random clusterings withvarying sizes. A perfect spatial distribution of the neurons willalways get a perfect score of 1, independent of the size of thecluster.

IV. RESULTS

A. Comparison of the algorithms

In this section the algorithms will be compared to eachother with the adjusted rand index (ARI) [17]. There are twodifferent possible ways two objects can be clustered in twoclusterings. First, they can be in the same partition in bothclusterings (SS) and second two objects can be in different

clusters in both clusterings (DD). M is the the number ofpairs of objects. The Rand index is then given by [31]

R(A,B) =SS + DD

M(7)

The adjusted rand index is the same, but adjusted forchance. This number will be 1 if both clusterings are exactlythe same and 0 if they both are random compared to eachother. This comparison is thus computed on the basis of howthe neurons are clustered by the algorithms. Another possibilitywould be to compare the cluster centers. This cluster centerdoes however not have a meaning for each algorithm (e.g., inspectral clustering).

All the rand indexes from both parts (stimulus and non-stimulus) and number of clusters 2 to 5 can be averaged tocreate a similarity matrix that represents all the clusterings.Only the scores on these cluster numbers are used becausemore clusters would divide neurons that do not really have aclustering structure. Such a clustering is useless anyway. Afterall this clustering work, the logic reflex is to create a newclustering with such a similarity matrix. Hierarchical clusteringwith average linkage is used for the clustering so that also theunderlying structure of the algorithms can be explored. Thisclustering has several advantages. After the clustering of thealgorithms, the similarity matrix can be restructured so thatit can be shown much more clear. This is shown in figure 1.The second more obvious reason is that it is interesting tosee which algorithms belong together. The dendrogram of thisclustering is shown in figure 2.

When this dendrogram is cut to create three clusters it isvery easy to explain this structure. A first cluster consists of thealgorithms KME and SRBF, this are the only two algorithmsthat use the unnormalized data. KME might use a differentdistance measure, the distance is still performed on the sameraw data where the original amplitudes are still important.This explains this first cluster. A second cluster is created bythe TICA, FUZC, SNN, HIER, KMHCN, KMC and KMFLPalgorithms. This includes almost all the k-means variations thatuse correlation as similarity measure. Also SNN and HIERuse the correlation matrix as input. The TICA is not really aclustering method, but also temporal ICA uses the correlationsbetween the time-series to produces the different independentcomponents. A third cluster of algorithms are the algorithmsthat use the normalized data as input. This are SRBFN, FUZN,KMN and NG. Although very different algorithms, spectralclustering, k-means, fuzzy and neural gas, all these algorithmsbelong to the same cluster because they use the normalizeddata as input.

When looking closer to the real rand indexes between thealgorithms in Figure 1 one can see that the KMN and NGalgorithm are very similar. Also the low-pass filtering has notso much effect, the KMFLP algorithm is almost the same asthe KM algorithm. This also means that the neurons could havebeen measured on a much lower resolution (four times as low)without loss of any information for the clustering.The closestalgorithm to the HIER algorithm is the KMHCN variation.Both algorithms deal with outliers. KMHCN deletes thembefore the clustering and HIER throws outlier clusters away.Also in this figure the three clusters of algorithms are quitegood visible.

Adjusted rand indexes

KMN NG FUZN SRBFN SMNN KMC KMFLP FUZC KMHCN HIER SNN TICA KME SRBF

KMN

NG

FUZN

SRBFN

SMNN

KMC

KMFLP

FUZC

KMHCN

HIER

SNN

TICA

KME

SRBF

0

0.2

0.4

0.6

0.8

1

Fig. 1. The rand indexes between the algorithms.

0.0

0.2

0.4

0.6

0.8

Dendrogram of the clusterings

Dis

tanc

e

KM

E

SR

BF

SM

NN

SR

BF

N

FU

ZN

KM

N

NG

TIC

A

SN

N

FU

ZC

KM

C

KM

FLP

KM

HC

N

HIE

R

Fig. 2. A clustering of the algorithms.

B. Spatial coefficient

In this paragraph the outputs of the algorithms will betested with the normalized SC. However, some clusteringsmay cheat by creating very small clusters. Dividing the neuronsinto two groups is quite obvious. This would be the division ofthe neurons in an active and an inactive part. However, findingmore clusters can be more difficult. Therefore some algorithmsbasically just keep the 2 main clusters and add some ’outliers’to a third cluster. With this technique they manage to keepa high normalized SC (almost the same as the normalizedSC with 2 clusters), but this would be no good clustering formore clusters. Therefore clusterings that create clusters withless than 50 objects are considered bad clusterings and aregiven a SC of 0.

In Figure 3 the normalized SC of these clusterings areshown combined with the size of the smallest cluster. Onlythe results for 2 − 5 clusters are shown because all the vali-dation measures implied such an optimal number of clusters.Obviously the minimum size of each clustering decreases whenthe number of clusters increases. But also the normalized SCmostly decreases with a higher number of clusters. It is wayeasier to divide the data into two than into more clusters.The division into two clusters gives almost always a highernormalized SC while there is less difference between the othernumber of clusters. It is however also interesting to see thatthis is certainly not the case for the FUZN algorithm. The

best normalized SC in the stimulus part is found for threeclusters there. Remember that the internal validation measureof the FUZN algorithms gave one of the clearest answers tothe question of the optimal number of clusters and that thiswas indeed three for the stimulus part.

It seems also to be easier to find more clusters in thestimulus part compared to the non-stimulus part. For examplethe TICA algorithm never finds more than two good clustersin the non-stimulus part but finds even five good clusters inthe stimulus part. The size of the minimum clusters is mostlylower in the non-stimulus part. This does however not meanthat the normalized SC of the non-stimulus part is worse thanthe one of the stimulus part. There are a lot of algorithmswhere the no-stimulus part creates higher normalized SC.

There is no algorithm that performs drastically better thanthe other algorithms, there are however algorithms that performsignificantly worse than others. The SRBF and KME algorithmdo not perform very well. They mostly do not find goodclusters with a higher number of clusterings. They both have incommon that they use the raw data. The data is not normalizedand they do not use correlation (which also implicitly wouldnormalize the data).

To rank the algorithms a ranked voting system will beused. The algorithms are ranked for each number of clusters(2 to 5) and for each part (stimulus and non-stimulus) by theirnormalized SC. The best algorithm gets 16 points, the second14 the third to last get 12 to 1 points. If no real clusteringis produced (less than 50 neurons in the smallest cluster), 0points are given. The results are shown in Table I

The contemporary algorithm, the spectral clustering withnearest neighbors seems to be the best in this ranking. Itis however followed very closely by the good old k-means(with correlations) algorithm and the hierarchical clusteringalgorithm. The algorithms with the non-normalized data areclearly the worst but also the ICA algorithm perform less thanmost clustering algorithms. This results do however not sayeverything. It is expected that clustered neurons lay close toeach other in the zebra-fish brain but is still no certainty. Thisranking gives probably an indication of good and bad, but cannot be seen as an absolute ranking.

It is interesting to note that the SNN algorithm, which is thebest algorithm in the SC comparison has no single rand indexhigher than 0.5. Although this algorithm also uses correlations,it is still quite different from all the other algorithms.

C. Cost of the algorithms

In Table I also the computational durations of all thealgorithms and variations are shown. The tuning of everyalgorithm is measured. One could also compute the durationof one clustering with a specific number of clusters and witha specific choice for a parameter. This would give however abiased result. It is always important to perform a good tuning,whatever the purpose of the clustering is. Therefore, a singleclustering without tuning would be rather useless. Everythingis measured in MATLAB on a Intel Core i7-3630QM 2.4 GHzCPU laptop with 8 GB RAM.

It is interesting to see that the correlation methods arefaster compared to the methods that use the Euclidean distance.

KMC KME KMFLPKMHCN KMN TICA FUZN FUZC SRBFN SRBF SNN SMNN HIER NG0

0.5

1Normalized SC non−stimulus

Nor

mal

ized

SC


500Smallest cluster non−stimulus

Siz

e


0.5

1Normalized SC stimulus

Nor

mal

ized

SC


500Smallest cluster stimulus

Siz

e

2 clusters3 clusters4 clusters5 clusters

Fig. 3. The normalized SC for the stimulus and the non-stimulus part combined with the size of the smallest cluster.

TABLE I. SUMMARY AND RANKING OF THE RESULTS.

Algorithm Group SCn score Duration (in seconds) LitteratureSNN Correlation 85 394 ?KMC Correlation 80 2040 ?HIER Correlation 80 185 ++KMFLP Correlation 75 1557 ?SMNN Correlation 73 518 ?FUZC Correlation 67 26 +KMN Normalized data 63 46472 ?FUZN Normalized data 63 1482 -NG Normalized data 62 526 + +SRBFN Normalized data 59 1327 ?KMHCN Normalized data 48 183 ?TICA Correlation 35 194 -SRBF Euclidean distance 15 1362 ?KME Euclidean distance 14 15679 +

FUZC is more than 50 times faster than the FUZN algorithm.The reason for this is that in the correlation algorithms, the datais normalized first. After this, the correlation can be computedas a simple dot product (see equation 2). This can be computedvery fast, compared to the Euclidean distance.

Also the SNN and SMNN algorithms are quite a bit fasteras the k-means algorithms. The k-means algorithms have thedisadvantage that a lot of replicates with random startingcenters are needed. Also the spectral clustering algorithmswork with k-means in a later phase, however this clusteringhappens in the feature space which is much smaller. Thereforethis algorithm is faster. Another big advantage of the SNNand SMNN algorithm is that they can be implemented withsparse matrices. This makes them suitable for larger data sets.The hierarchical clustering algorithm is a bit faster as thespectral clustering algorithms, however for bigger problems,the sparse implementation can be very important. Hierarchicalclustering algorithms have to store the whole similarity matrix.This matrix scales bad, it will grow quadratically.

V. CONCLUSION

The most important conclusion concerns the distance mea-sure or the preprocessing technique. In this paper 3 big optionsare discussed. First there is the use of the raw data, this also

incorporates the difference in amplitudes. This amplitudes areclearly not the most important clustering feature accordingto the normalized SC. When the RBF similarity measure orthe Euclidean distance is used on the raw data, the clusteringfails. Both algorithms fail to produce acceptable clusters. Theyare only able to split the set in an activated cluster and non-activated cluster. If more clusters are asked, both the algorithmsfail to produce them.

A second set of algorithms normalizes the data first.The amplitudes are not important anymore. These algorithmsperform significantly better than the algorithms that use theraw data.

A third set of algorithms uses correlation as distance mea-sure. This is clearly the best choice according to the normalizedSC. A second important advantage of these correlation meth-ods is that the correlation is much easier to compute as theEuclidean distance. If the data is normalized, the correlationreduces to a dot product, which is computationally reallyfast. It is striking that so few previous research discussed thechoice of the distance measure or the preprocessing techniques.Mostly it was just assumed that a certain method is the best,unfortunately a lot of research used the Euclidean distance.

It is also clear that the clustering algorithms perform betterthan the ICA algorithm. Only the worst clustering variations,the ones on the raw data, perform worse than the ICA method.

Contemporary research on clustering fMRI data mostlyuse hierarchical clustering with correlations. This is alreadya good start, but as shown in this paper this can be improvedin accuracy by using spectral clustering with nearest neigh-bors. The quadratically scaled distance matrix, needed for thehierarchical clustering, will also be a disadvantage that can beavoided when the spectral clustering algorithm with a sparseimplementation is used.

May 20, 2013

ACKNOWLEDGMENT

The authors would like to thank Yaksi Lab at IMEC forthe production of the data.

REFERENCES

[1] M. B. Ahrens, M. B. Orger, D. N. Robson, J. M. Li, and P. J. Keller.Whole-brain functional imaging at cellular resolution using light-sheetmicroscopy. Nat Meth, 10(5):413–420, 2013.

[2] E. Alhoniemi, J. Himberg, J. Parhankangas, and J. . Vesanto. SOMtoolbox 2.0, May 2013.

[3] R. Baumgartner, G. Scarth, C. Teichtmeister, R. Somorjai, and E. Moser.Fuzzy clustering of gradient-echo functional MRI in the human visualcortex. part i: Reproducibility. Journal of Magnetic Resonance Imaging,7(6):1094–1101, 1997.

[4] A. Belouchrani, K. Abed-Meraim, J.-F. Cardoso, and E. Moulines. Ablind source separation technique using second-order statistics. SignalProcessing, IEEE Transactions on, 45(2):434–444, 1997.

[5] J. C. Bezdek. Cluster validity with fuzzy sets. Journal of Cybernetics,3(3):58–73, 1973.

[6] J. C. Bezdek. Pattern Recognition with Fuzzy Objective FunctionAlgorithms. Kluwer Academic Publishers, Norwell, MA, USA, 1981.

[7] B. B. Biswal and J. L. Ulmer. Blind source separation of multiplesignal sources of fMRI data sets using independent component analysis.Journal of Computer Assisted Tomography, 23:265–271, 1999.

[8] V. D. Calhoun, T. Adali, L. K. Hansen, J. Larsen, and J. J. Pekar.ICA of functional MRI data: An overview. In in Proceedings of theInternational Workshop on Independent Component Analysis and BlindSignal Separation, pages 281–288, 2003.

[9] D. Cordes, V. Haughton, J. D. Carew, K. Arfanakis, and K. Maravilla.Hierarchical clustering to measure connectivity in fMRI resting-statedata. Magnetic Resonance Imaging, 20(4):305 – 317, 2002.

[10] D. L. Davies and D. W. Bouldin. A cluster separation measure.Pattern Analysis and Machine Intelligence, IEEE Transactions on,PAMI-1(2):224 –227, Apr. 1979.

[11] E. Dimitriadou, M. Barth, C. Windischberger, K. Hornik, and E. Moser.A quantitative comparison of functional MRI cluster analysis. ArtificialIntelligence in Medicine, 31(1):57 – 71, 2004.

[12] J. C. Dunn. A fuzzy relative of the isodata process and its use indetecting compact well-separated clusters. Journal of Cybernetics,3(3):32–57, 1973.

[13] P. Filzmoser, R. Baumgartner, and E. Moser. A hierarchical clusteringmethod for analyzing functional mr images. Magnetic ResonanceImaging, 17(6):817 – 826, 1999.

[14] X. Golay, S. Kollias, G. Stoll, D. Meier, A. Valavanis, and P. Boesiger.A new correlation-based fuzzy logic clustering algorithm for FMRI.Magnetic Resonance in Medicine, 40(2):249–260, 1998.

[15] J. A. Hartigan. Clustering Algorithms. John Wiley & Sons, Inc., NewYork, NY, USA, 99th edition, 1975.

[16] H. He, B. DeZonia, and D. Mladenoff. An aggregation index (ai) toquantify spatial patterns of landscapes. Landscape Ecology, 15:591–601, 2000.

[17] L. Hubert and P. Arabie. Comparing partitions. Journal of Classifica-tion, 2(1):193–218, 1985.

[18] L. Hubert and J. Schultz. Quadratic assignment as a general dataanalysis strategy. British Journal of Mathematical and StatisticalPsychology, 29(2):190–241, 1976.

[19] A. Hyvainen and E. Oja. Independent component analysis: algorithmsand applications. Neural Networks, 13(45):411 – 430, 2000.

[20] A. K. Jain. Data clustering: 50 years beyond k-means. PatternRecognition Letters, 31(8):651 – 666, 2010.

[21] V. Kiviniemi, J.-H. Kantola, J. Jauhiainen, A. Hyvarinen, and O. Tervo-nen. Independent component analysis of nondeterministic fMRI signalsources. NeuroImage, 19(2):253 – 260, 2003.

[22] T. W. Liao. Clustering of time series data - a survey. PatternRecognition, 38(11):1857 – 1874, 2005.

[23] X. Liu, X.-H. Zhu, P. Qiu, and W. Chen. A correlation-matrix-basedhierarchical clustering method for functional connectivity analysis.Journal of Neuroscience Methods, 211(1):94 – 102, 2012.

[24] U. Luxburg. A tutorial on spectral clustering. Statistics and Computing,17(4):395–416, 2007.

[25] T. Martinetz, S. Berkovich, and K. Schulten. ‘neural-gas’ network forvector quantization and its application to time-series prediction. NeuralNetworks, IEEE Transactions on, 4(4):558–569, 1993.

[26] M. J. McKeown, S. Makeig, G. G. Brown, T.-P. Jung, S. S. Kindermann,A. J. Bell, and T. J. Sejnowski. Analysis of fMRI data by blind sep-aration into independent spatial components. Human Brain Mapping,6:160–188, 1994.

[27] A. Meyer-Baese, A. Wismueller, and O. Lange. Comparison of twoexploratory data analysis methods for fMRI: unsupervised clusteringversus independent component analysis. Information Technology inBiomedicine, IEEE Transactions on, 8(3):387 –398, Sept. 2004.

[28] N. Miljkovic, V. Matic, S. Van Huffel, and M. Popovic. Independentcomponent analysis (ICA) methods for neonatal eeg artifact extraction:Sensitivity to variation of artifact properties. In Neural Network Ap-plications in Electrical Engineering (NEUREL), 2010 10th Symposiumon, pages 19–21, 2010.

[29] J. R. Moeller and S. C. Strother. A regional covariance approach to theanalysis of functional patterns in positron emission tomographic data.J Cereb Blood Flow Metab, 11:121–135, 1991.

[30] T. Panier, S. A. Romano, T. Pietri, G. Sumbre, R. Candelier, andG. Debregeas. Fast functional imaging of multiple brain regions inintact zebrafish larvae using selective plane illumination microscopy.Frontiers in neural circuits, 2013.

[31] W. M. Rand. Objective criteria for the evaluation of clustering methods.Journal of the American Statistical Association, 66(336):846–850, 1971.

[32] P. J. Rousseeuw. Silhouettes: A graphical aid to the interpretation andvalidation of cluster analysis. Journal of Computational and AppliedMathematics, 20(0):53 – 65, 1987.

[33] V. Schwammle and O. N. Jensen. A simple and fast method to determinethe parameters for fuzzy c-means cluster analysis. Bioinformatics,26(22):2841–2848, 2010.

[34] R. L. Somorjai, M. Jarmasz, and R. Baumgartner. Evident: A two-stagestrategy for the exploratory analysis of functional MRI data by fuzzyclustering. Technical report, 2000.

[35] H. Steinhaus. Sur la division des corp materiels en parties. Bull. Acad.Polon. Sci, pages 801–804, 1956.

[36] J. J. Sychra, P. A. Bandettini, N. Bhattacharya, and Q. Lin. Syntheticimages by subspace transforms i. principal components images andrelated filters. Medical Physics, 21(2):193–201, 1994.

[37] V. G. van de Ven, E. Formisano, D. Prvulovic, C. H. Roeder, and D. E.Linden. Functional connectivity as revealed by spatial independentcomponent analysis of fMRI measurements during rest. Human BrainMapping, 22(3):165–178, 2004.

[38] E. Yaksi, F. von Saint Paul, J. Niessing, S. T. Bundschuh, and R. W.Friedrich. Transformation of odor representations in target areas of theolfactory bulb. Nat Neurosci, pages 474–482, 2009.

Bibliography

[1] M. B. Ahrens, M. B. Orger, D. N. Robson, J. M. Li, and P. J. Keller. Whole-brain functional imaging at cellular resolution using light-sheet microscopy. NatMeth, 10(5):413–420, 2013.

[2] E. Alhoniemi, J. Himberg, J. Parhankangas, and J. . Vesanto. SOM toolbox2.0. URL: http://www.cis.hut.fi/projects/somtoolbox/, last checked on2013-16-05.

[3] J. D. U. Anand Rajaraman. Mining of Massive Datasetss. Cambridge UniversityPress, New York, United States of America, 2010.

[4] P. A. Bandettini, A. Jesmanowicz, E. C. Wong, and J. S. Hyde. Processing strate-gies for time-course data sets in functional mri of the human brain. MagneticResonance in Medicine, 30(2):161–173, 1993.

[5] R. Baumgartner, G. Scarth, C. Teichtmeister, R. Somorjai, and E. Moser. Fuzzyclustering of gradient-echo functional MRI in the human visual cortex. part i:Reproducibility. Journal of Magnetic Resonance Imaging, 7(6):1094–1101, 1997.

[6] R. Baumgartner, C. Windischberger, and E. Moser. Quantification in functionalmagnetic resonance imaging: Fuzzy clustering vs. correlation analysis. MagneticResonance Imaging, 16(2):115 – 125, 1998.

[7] A. Baune, F. T. Sommer, M. Erb, D. Wildgruber, B. Kardatzki, G. Palm, andW. Grodd. Dynamical cluster analysis of cortical fMRI activation. NeuroImage,9(5):477 – 489, 1999.

[8] A. Belouchrani, K. Abed-Meraim, J.-F. Cardoso, and E. Moulines. A blindsource separation technique using second-order statistics. Signal Processing,IEEE Transactions on, 45(2):434–444, 1997.

[9] K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft. When is ’nearestneighbor’ meaningful? In C. Beeri and P. Buneman, editors, Database Theory -ICDT ’99, volume 1540 of Lecture Notes in Computer Science, pages 217–235.Springer Berlin Heidelberg, 1999.

[10] J. C. Bezdek. Cluster validity with fuzzy sets. Journal of Cybernetics, 3(3):58–73,1973.

99

http://www.cis.hut.fi/projects/somtoolbox/

Bibliography

[11] J. C. Bezdek. Pattern Recognition with Fuzzy Objective Function Algorithms.Kluwer Academic Publishers, Norwell, MA, USA, 1981.

[12] B. B. Biswal and J. L. Ulmer. Blind source separation of multiple signal sourcesof fMRI data sets using independent component analysis. Journal of ComputerAssisted Tomography, 23:265–271, 1999.

[13] V. D. Calhoun, T. Adali, L. K. Hansen, J. Larsen, and J. J. Pekar. ICA offunctional MRI data: An overview. In in Proceedings of the InternationalWorkshop on Independent Component Analysis and Blind Signal Separation,pages 281–288, 2003.

[14] D. Chakrabarti, R. Kumar, and A. Tomkins. Evolutionary clustering. InProceedings of the 12th ACM SIGKDD international conference on Knowledgediscovery and data mining, KDD ’06, pages 554–560, New York, NY, USA, 2006.ACM.

[15] Y. Chi, X. Song, D. Zhou, K. Hino, and B. L. Tseng. Evolutionary spectralclustering by incorporating temporal smoothness. In Proceedings of the 13thACM SIGKDD international conference on Knowledge discovery and data mining,KDD ’07, pages 153–162, New York, NY, USA, 2007. ACM.

[16] K.-S. Chuang, H.-L. Tzeng, S. Chen, J. Wu, and T.-J. Chen. Fuzzy c-meansclustering with spatial information for image segmentation. ComputerizedMedical Imaging and Graphics, 30(1):9 – 15, 2006.

[17] D. Cordes, V. Haughton, J. D. Carew, K. Arfanakis, and K. Maravilla. Hierar-chical clustering to measure connectivity in fMRI resting-state data. MagneticResonance Imaging, 20(4):305 – 317, 2002.

[18] D. L. Davies and D. W. Bouldin. A cluster separation measure. Pattern Analysisand Machine Intelligence, IEEE Transactions on, PAMI-1(2):224 –227, Apr.1979.

[19] E. Dimitriadou, M. Barth, C. Windischberger, K. Hornik, and E. Moser. Aquantitative comparison of functional MRI cluster analysis. Artificial Intelligencein Medicine, 31(1):57 – 71, 2004.

[20] P. Drineas, A. Frieze, R. Kannan, S. Vempala, and V. Vinay. Clustering largegraphs via the singular value decomposition. Machine Learning, 56:9–33, 2004.

[21] J. C. Dunn. A fuzzy relative of the isodata process and its use in detectingcompact well-separated clusters. Journal of Cybernetics, 3(3):32–57, 1973.

[22] P. Filzmoser, R. Baumgartner, and E. Moser. A hierarchical clustering methodfor analyzing functional mr images. Magnetic Resonance Imaging, 17(6):817 –826, 1999.

100

Bibliography

[23] E. B. Fowlkes and C. L. Mallows. A method for comparing two hierarchicalclusterings. Journal of the American Statistical Association, 78(383):553–569,1983.

[24] E. Gardner. Fundamentals of Neurology, 6th ed. W.B.Saunders, Philadelphia,USA, 1975.

[25] X. Golay, S. Kollias, G. Stoll, D. Meier, A. Valavanis, and P. Boesiger. Anew correlation-based fuzzy logic clustering algorithm for FMRI. MagneticResonance in Medicine, 40(2):249–260, 1998.

[26] C. Goutte, L. K. Hansen, M. G. Liptrot, and E. Rostrup. Feature-space clusteringfor fMRI meta-analysis. Human Brain Mapping, 13(3):165–183, 2001.

[27] C. Goutte, P. Toft, E. Rostrup, F. . Nielsen, and L. K. Hansen. On clusteringfMRI time series. NeuroImage, 9(3):298 – 310, 1999.

[28] S. Guha, R. Rastogi, and K. Shim. Cure: an efficient clustering algorithm forlarge databases. SIGMOD Rec., 27(2):73–84, June 1998.

[29] T. Hanai, H. Hamada, and M. Okamoto. Application of bioinformatics for dnamicroarray data to bioscience, bioengineering and medical fields. Journal ofBioscience and Bioengineering, 101(5):377–384, 2006.

[30] J. A. Hartigan. Clustering Algorithms. John Wiley & Sons, Inc., New York, NY,USA, 99th edition, 1975.

[31] B. Hawwash and O. Nasraoui. Stream-dashboard: a framework for mining,tracking and validating clusters in a data stream. In Proceedings of the 1stInternational Workshop on Big Data, Streams and Heterogeneous Source Mining:Algorithms, Systems, Programming Models and Applications, BigMine ’12, pages109–117, New York, NY, USA, 2012. ACM.

[32] H. He, B. DeZonia, and D. Mladenoff. An aggregation index (ai) to quantifyspatial patterns of landscapes. Landscape Ecology, 15:591–601, 2000.

[33] L. Hubert and P. Arabie. Comparing partitions. Journal of Classification,2(1):193–218, 1985.

[34] L. Hubert and J. Schultz. Quadratic assignment as a general data analysisstrategy. British Journal of Mathematical and Statistical Psychology, 29(2):190–241, 1976.

[35] A. Hyväinen and E. Oja. Independent component analysis: algorithms andapplications. Neural Networks, 13(45):411 – 430, 2000.

[36] P. Jaccard. Étude comparative de la distribution florale dans une portiondes alpes et des jura. Bulletin de la Société Vaudoise des Sciences Naturelles,37:547–579, 2002.

101

Bibliography

[37] A. K. Jain. Data clustering: 50 years beyond k-means. Pattern RecognitionLetters, 31(8):651 – 666, 2010.

[38] P. Kalnis, N. Mamoulis, and S. Bakiras. On discovering moving clusters inspatio-temporal data. In C. Bauzer Medeiros, M. Egenhofer, and E. Bertino,editors, Advances in Spatial and Temporal Databases, volume 3633 of LectureNotes in Computer Science, pages 364–381. Springer Berlin Heidelberg, 2005.

[39] V. Kiviniemi, J.-H. Kantola, J. Jauhiainen, A. Hyvärinen, and O. Tervonen.Independent component analysis of nondeterministic fMRI signal sources. Neu-roImage, 19(2):253 – 260, 2003.

[40] Y. Li, J. Han, and J. Yang. Clustering moving objects. In Proceedings of thetenth ACM SIGKDD international conference on Knowledge discovery and datamining, KDD ’04, pages 617–622. ACM, 2004.

[41] T. W. Liao. Clustering of time series data - a survey. Pattern Recognition,38(11):1857 – 1874, 2005.

[42] X. Liu, X.-H. Zhu, P. Qiu, and W. Chen. A correlation-matrix-based hierarchicalclustering method for functional connectivity analysis. Journal of NeuroscienceMethods, 211(1):94 – 102, 2012.

[43] L. Lovasz. Random walks on graphs: a survey. Combinatorics, Paul Erdos iseighty, pages 353–397, 1994.

[44] U. Luxburg. A tutorial on spectral clustering. Statistics and Computing,17(4):395–416, 2007.

[45] T. Martinetz, S. Berkovich, and K. Schulten. ‘neural-gas’ network for vectorquantization and its application to time-series prediction. Neural Networks,IEEE Transactions on, 4(4):558–569, 1993.

[46] MATLAB. version R2012a. The MathWorks Inc., Natick, Massachusetts, 2012.

[47] M. J. McKeown, S. Makeig, G. G. Brown, T.-P. Jung, S. S. Kindermann, A. J.Bell, and T. J. Sejnowski. Analysis of fMRI data by blind separation intoindependent spatial components. Human Brain Mapping, 6:160–188, 1994.

[48] A. Meyer-Baese, A. Wismueller, and O. Lange. Comparison of two exploratorydata analysis methods for fMRI: unsupervised clustering versus independentcomponent analysis. Information Technology in Biomedicine, IEEE Transactionson, 8(3):387 –398, Sept. 2004.

[49] A. Mezer, Y. Yovel, O. Pasternak, T. Gorfine, and Y. Assaf. Cluster analysis ofresting-state fMRI time series. NeuroImage, 45(4):1117 – 1125, 2009.

102

Bibliography

[50] N. Miljkovic, V. Matic, S. Van Huffel, and M. Popovic. Independent componentanalysis (ICA) methods for neonatal eeg artifact extraction: Sensitivity tovariation of artifact properties. In Neural Network Applications in ElectricalEngineering (NEUREL), 2010 10th Symposium on, pages 19–21, 2010.

[51] J. R. Moeller and S. C. Strother. A regional covariance approach to the analysisof functional patterns in positron emission tomographic data. J Cereb BloodFlow Metab, 11:121–135, 1991.

[52] M. Oliveira and J. Gama. Bipartite graphs for monitoring clusters transitions.In P. Cohen, N. Adams, and M. Berthold, editors, Advances in Intelligent DataAnalysis IX, volume 6065 of Lecture Notes in Computer Science, pages 114–124.Springer Berlin Heidelberg, 2010.

[53] N. Pal and J. Bezdek. On cluster validity for the fuzzy c-means model. FuzzySystems, IEEE Transactions on, 3(3):370 –379, Aug. 1995.

[54] N. Pal and J. Biswas. Cluster validation using graph theoretic concepts. PatternRecognition, 30(6):847 – 857, 1997.

[55] T. Panier, S. A. Romano, T. Pietri, G. Sumbre, R. Candelier, and G. Debregeas.Fast functional imaging of multiple brain regions in intact zebrafish larvae usingselective plane illumination microscopy. Frontiers in neural circuits, 2013.

[56] D. Pelvig, H. Pakkenberg, A. Stark, and B. Pakkenberg. Neocortical glial cellnumbers in human brains. Neurobiology of Aging, 29(11):1754 – 1762, 2008.

[57] W. M. Rand. Objective criteria for the evaluation of clustering methods. Journalof the American Statistical Association, 66(336):846–850, 1971.

[58] F. J. Rohlf. Adaptive hierarchical clustering schemes. Systematic Biology,19(1):58–82, 1970.

[59] J. Rosswog and K. Ghose. Detecting and tracking spatio-temporal clusters withadaptive history filtering. In Data Mining Workshops, 2008. ICDMW ’08. IEEEInternational Conference on, pages 448 –457, Dec. 2008.

[60] P. J. Rousseeuw. Silhouettes: A graphical aid to the interpretation and validationof cluster analysis. Journal of Computational and Applied Mathematics, 20(0):53– 65, 1987.

[61] V. Schwämmle and O. N. Jensen. A simple and fast method to determine theparameters for fuzzy c-means cluster analysis. Bioinformatics, 26(22):2841–2848,2010.

[62] G. Seber. Multivariate Observations. John Wiley and Sons, New Jersey, 2009.

[63] J. Shi and J. Malik. Normalized cuts and image segmentation. Pattern Analysisand Machine Intelligence, IEEE Transactions on, 22(8):888–905, 2000.

103

Bibliography

[64] R. R. Sokal and F. J. R. Taxon. The comparison of dendrograms by objectivemethods. British Journal of Mathematical and Statistical Psychology, 11(2):33–40, Feb. 1962.

[65] R. L. Somorjai, M. Jarmasz, and R. Baumgartner. Evident: A two-stage strategyfor the exploratory analysis of functional MRI data by fuzzy clustering. Technicalreport, 2000.

[66] H. Spath. Cluster analysis algorithms for data reduction and classification ofobjects. Wiley, New York, USA, 1980.

[67] M. Spiliopoulou, I. Ntoutsi, Y. Theodoridis, and R. Schult. Monic: modelingand monitoring cluster transitions. In Proceedings of the 12th ACM SIGKDDinternational conference on Knowledge discovery and data mining, KDD ’06,pages 706–711. ACM, 2006.

[68] H. Steinhaus. Sur la division des corp materiels en parties. Bull. Acad. Polon.Sci, pages 801–804, 1956.

[69] J. J. Sychra, P. A. Bandettini, N. Bhattacharya, and Q. Lin. Synthetic images bysubspace transforms i. principal components images and related filters. MedicalPhysics, 21(2):193–201, 1994.

[70] V. G. van de Ven, E. Formisano, D. Prvulovic, C. H. Roeder, and D. E. Linden.Functional connectivity as revealed by spatial independent component analysisof fMRI measurements during rest. Human Brain Mapping, 22(3):165–178, 2004.

[71] Wikipedia. Fuzzy clustering, 2012. [Online; accessed 01-December-2012].

[72] C. Windischberger, M. Barth, C. Lamm, L. Schroeder, H. Bauer, R. C. Gur, andE. Moser. Fuzzy cluster analysis of high-field functional MRI data. ArtificialIntelligence in Medicine, 29(3):203 – 223, 2003.

[73] E. Yaksi, F. von Saint Paul, J. Niessing, S. T. Bundschuh, and R. W. Friedrich.Transformation of odor representations in target areas of the olfactory bulb.Nat Neurosci, pages 474–482, 2009.

[74] T. Zhang, R. Ramakrishnan, and M. Livny. Birch: A new data clusteringalgorithm and its applications. Data Mining and Knowledge Discovery, 1:141–182, 1997.

104

KU Leuven Faculteit Ingenieurswetenschappen 2012 – 2013

Fiche masterproef

Student: Merijn MestdaghTitel: Clustering neural dataNederlandse titel: Het clusteren van neurale dataUDC : 51-7Korte inhoud:In this thesis, the unsupervised learning of a new kind of data is discussed. Thisdata is created with recently developed techniques where one is able to measuresingle neural activity on a high spatial and temporal resolution. There is already ahistory of clustering brain data time series but such methods have never been usedon this new single-cell resolution data. In previous research there has not yet beena good comparative study for the clustering of similar time series, which comparesdistance measures, as well as algorithms and includes contemporary knowledge aboutclustering.

In this study, it is tried to find an optimal solution for the clustering of this data.First the k-means algorithm is used as a basic algorithm to compare with. Manyvariations of this k-means algorithm are tested. The variations include two differentdistance measures, correlation and Euclidean distance, and different preprocessingtechniques like filtering, normalization and outlier detection. This k-means algorithmis however also compared with other clustering algorithms: the fuzzy c-means algo-rithm, hierarchical clustering, the neural gas algorithm and the more contemporaryspectral clustering. Also in these algorithms, some different preprocessing tech-niques or distance measures are tried out. Because independent component analysistechniques are used a lot on this kind of data, also clusterings achieved with suchtechniques are discussed and compared. For every algorithm, all the parameters arecarefully tuned. For all the clustering algorithms, a suitable intern validity measureis used to decide on the optimal number of clusters.

The algorithms and variations where compared on three aspects. First, theirperformance in previous literature is assessed, second their computational cost ismeasured and third their performance is measured based on an external validationmeasure. It is expected that neurons which are clustered by their time series, arealso located close to each other in the brain. An advantage of the single-cell imagingtechniques is that one knows exactly where each neuron is located. Using thisinformation, a new coefficient is proposed to measure spatial connectivity.

The results were that the choice of the distance measure was far more importantthan the choice of the algorithm. The different algorithms failed to bring closelyas much variation as the different preprocessing techniques or distance measures.The best variations according to the spatial connectivity were the spectral clusteringalgorithm, the k-means clustering algorithm and the hierarchical clustering algorithm,all using the correlation as distance measure. There is however a larger differencebetween the algorithms in terms of computation cost. The spectral clusteringalgorithm can be implemented with sparse matrices which provide a fast algorithm

Bibliography

with a low memory cost. This makes this algorithm the algorithm of choice, followedby the hierarchical clustering and the k-means algorithm. However also in theassessment of the computational cost, the distance measure is important. Variationsthat use the correlation measure are faster than the others.

Thesis voorgedragen tot het behalen van de graad van Master of Science in deingenieurswetenschappen: wiskundige ingenieurstechniekenPromotoren: Prof. dr. ir. B. De Moor

Prof. dr. E. YaksiAssessoren: Prof. dr. ir. K. Meerbergen

Prof. dr. ir.W. MichielsBegeleiders: Dr. ir. O.M. Agudelo

Ir. P. DreesenIr. N. Verbeeck

106

Clustering neural data - KU Leuvenmaapc/master_theses... · 2013-08-26 · Clustering neural data...

Documents

Transcript of Clustering neural data - KU Leuvenmaapc/master_theses... · 2013-08-26 · Clustering neural data...