How are cluster analysis diagrams generated?
This topic explains how the data underlying a cluster analysis diagram is generated.
Measuring similarity
To measure the similarity between each pair of items that will appear in a cluster diagram, NVivo first builds a table where:
-
the rows are the sources, nodes or words that will appear in the diagram, and
-
the columns and cells depend on which characteristic you’ve chosen to cluster by.
Table rows | Clustered by | Table columns | Table cells |
Sources | Word similarity | Each different word that appears in the text of the sources | The number of times the column’s word appears in the row’s source |
Coding similarity | Each node that codes the sources’ content | 1 if the column’s node codes the row’s source, 0 otherwise | |
Attribute value similarity | Each different attribute value of the sources (e.g. Book:Year = 2010) | 1 if the row’s source has the column’s attribute value, 0 otherwise | |
Nodes | Word similarity | Each different word that appears in the text of the nodes | The number of times the column’s word appears in the row’s node |
Coding similarity | Each source coded by the row’s node | 1 if the column’s source is coded by the row’s node, 0 otherwise | |
Attribute value similarity | Each different attribute value of the nodes (e.g. Person:Sex = Female) | 1 if the row’s node has the column’s attribute value, 0 otherwise | |
Words (top 100 words in Word Frequency query results) |
N/A |
Each source or node that the query searches in | The number of times the row’s word appears in the column’s source or node |
NVivo then calculates a similarity index between each pair of items (each pair of rows in the table) using the similarity metric you’ve selected:
-
Pearson correlation coefficient (-1 = least similar, 1 = most similar). For more information, refer to the Wikipedia article Pearson product-moment correlation coefficient.
-
Jaccard’s coefficient (0 = least similar, 1 = most similar). For more information, refer to the Wikipedia article Jaccard index.
-
Sørensen’s coefficient (0 = least similar, 1 = most similar). For more information, refer to the Wikipedia article Sørensen similarity index.
Forming clusters
Using the calculated similarity index between each pair of items, NVivo groups the items into a number of clusters (10 by default), using the complete linkage (farthest neighbor) hierarchical clustering algorithm. For more information, refer to the Wikipedia article Complete-linkage clustering.
Generating a dendrogram
By default the results of the cluster analysis are displayed as a dendrogram, which is generated using the same complete linkage (farthest neighbor) hierarchical clustering technique that is used to form the clusters.
Generating a cluster map
The cluster analysis results can also be displayed as a 2D or 3D cluster map, where the items in the cluster analysis are represented as points in space.
The cluster map is generated using an iterative multidimensional scaling algorithm. Initially, the items are placed randomly as data points in a square or cube, and then a series of iterations are performed to optimize the positions of the items. The optimal distance between each pair of items is defined as 1.1 minus the similarity index between the items. At each iteration, the actual distance between each pair of items is compared to the optimal distance between them, and the data points are moved closer together or further apart accordingly. The algorithm ends when an optimal configuration is reached that cannot be improved by further movement of the data points.