## Basic questions in cluster analysis

In an introduction to clustering procedures, it makes sense to focus on methods that assign each subject to only one class. Subjects within a class are usually assumed to be indistinguishable from one another.

We assume that the underlying structure of the data involves an unordered set of discrete classes. They’re all different, and none has more weight than another. In some cases, we may also view these classes as hierarchical in nature, with some classes divided into subclasses.

Clustering procedures can be viewed as “pre-classificatory” in the sense that the researcher has not used prior judgment to partition the subjects (rows of the data matrix). However, it is assumed that some of the objectives are heterogeneous; that is, that “clusters” exist.

This presupposition of different groups is based on commonalities within the set of inputs into the algorithm, or clustering variables. This assumption is different from the one made in the case of discriminant analysis or automatic interaction detection, where the dependent variable is used to formally define groups of objects and the distinction is not made on the basis of profile resemblance in the data matrix itself.

Thus, given that no information on group definition is formally evaluated in advance, the imperative questions of cluster analysis will be:

- What measure of inter-subject similarity is to be used and how is each variable to be “weighted” in the construction of such a summary measure?
- After inter-subject similarities are obtained, how are the classes to be formed?
- After the classes have been formed, what summary measures of each cluster are appropriate in a descriptive sense; that is, how are the clusters to be defined?
- Assuming that adequate descriptions of the clusters can be obtained, what inferences can be drawn regarding their statistical significance?

## What about non-scalar data?

So far, we’ve talked about scalar data – things differ from each other by degrees along a scale, such as numerical quantity or degree. But what about items that are non-scalar and can only be sorted into categories (as with things like color, species or shape)?

This question is important for applications likesurvey data analysis, since you’re likely to be dealing with a mix of formats that include both categorical and scalar data.

## Cluster analysis algorithms

Your choice of cluster analysis algorithm is important, particularly when you have mixed data. In major statistics packages you’ll find a range of preset algorithms ready to number-crunch your matrices. Here are two of the most suitable for cluster analysis.

**K-Means**algorithm establishes the presence of clusters by finding their centroid points. A centroid point is the average of all the data points in the cluster. By iteratively assessing the Euclidean distance between each point in the dataset, each one can be assigned to a cluster. The centroid points are random to begin with and will change each time as the process is carried out.K-means is commonly used in cluster analysis, but it has a limitation in being mainly useful for scalar data.**K-Medoids**works in a similar way to k-means, but rather than using mean centroid points which don’t equate to any real points from the dataset, it establishes medoids, which are real interpretable data-points.K-medoids offers an advantage for survey data analysis as it is suitable for both categorical and scalar data. This is because rather than measuring Euclidean distance between the medoid point and its neighbours, the algorithm can measure distance in multiple dimensions, representing a number of different categories or variables.

In both cases (k) = the number of clusters.

## Cluster analysis + factor analysis

When you’re dealing with a large number of variables, for example a lengthy or complexsurvey, it can be useful to simplify your data before performing cluster analysis so that it’s easier to work with. Using factors reduces the number of dimensions that you’re clustering on, and can result in clusters that are more reflective of the true patterns in the data.

Factor analysisis a technique for taking large numbers of variables and combining those that relate to the same underlying factor or concept, so that you end up with a smaller number of dimensions. For example, factor analysis might help you replace questions like “Did you receive good service?” “How confident were you in the agent you spoke to?” and “Did we resolve your query?” with a single factor – customer satisfaction.

This way you can reduce messiness and complexity in your data and arrive more quickly at a manageable number of clusters.