Cluster Filter | |
---|---|

Description |
Find clusters of data points with user defined rules. |

Threaded |
No |

Memory Usage |
Heavy, makes new elements |

Stream Support |
Yes |

Added In |
0.3.0 |

## Cluster Filter

### Basics

This filter is used to find clusters of elements based on user specified rules. There are quite a few different options for this filter, making it rather powerful. Note that it can be quite slow though. There are two primary boolean options that tell this filter how to work.

- Compare clusters? - If this is off, individual data points are always compared to one another. If it is off, clusters are compared and at first each point is its own cluster.
- Use Spatial Tree? - When this is off, every point is compared to every other point. If the grouping has locality in some space, this option can be used so that the comparisons are only done against other elements that are nearby in that space.

The other options include variable declarations and two different formulas. The most important formula is one that says what the condition is for merging. By default this formula has a value of 1=1 so that everything is quickly merged for efficiency. The other formula specifies a search distance and is only used when you have told the filter to use a spatial tree. The figure below shows the first tab of the filter with default values.

At the top of this tab is a button labeled "Setup for Gravity". This will be discussed in detail below. Under that is the option for comparing clusters. Below that you have a list of cluster variables. These are only used if you have selected to compare clusters. Cluster variables have four settings to them, the first is the variable name. After that are two formulas for value and weight. Last is a selection for how to combine the cluster values, whether to simply sum them or to take a weighted average. If you select to sum, the weight should be 1. When you use these variables in formulas, such as the Merge Criteria formula at the bottom, there are always two clusters being considered. For this reason, the variable names that you provided are given a suffix of i and j. This will appear in some of the figures below.

Not at the bottom of this, right above the "Propagate changes" button is the text "Not Processing". Because the cluster filter can take a long time to process, the text here will change during the processing to let you know how far it has gotten.

The second tab for this filter looks like figure below and contains options for the spatial tree as well as general variables. There is a formula for the search distance on the spatial tree. Below this there are two panels for adding variables. Unlike the cluster variables, these variables only have a name and a value.

### Order of evaluation of variables

Most of the settings for this filter are variables declarations where you provide formulas. These formulas can use other variables in some cases so it is very significant that you understand the order in which variables are added and when the formulas are actually evaluated. Note that the cluster variables are only used if the "Compare Clusters" option is turned on. Similarly, the spatial variables are only used if you tell the filter you want to use a spatial tree.

First, the formulas for the spatial variables and the cluster variables are evaluated at the very beginning and can not use any other variables in their definition. These formulas should use standard value and parameter expressions. The values for the spatial variables will never change after their initial evaluation. The values of the cluster variables will as they will become either sums or weighted averages of the subclusters that merge.

During processing the filter stores variables for the spatial values and group values for the "i" element first. It then evaluated the search distance formula. As such, the search distance formula can use either spatial variables or cluster variables with an "i" suffix. After that, the spatial and group variables for the other particle are added with a "j" suffix. Lastly the other variables are added before the merge criteria is checked.

The values for the general variables are added one after another. So the formulas for those variables can only only contain both "i" and "j" version of the spatial and cluster variables, they can also contain any of the general variables above them in the list. The merge criteria can contain any of the above as it comes last.

### Output

Each input stream to this filter will have two output streams associated with it. The first stream has a copy of the input data with one additional parameter saying which cluster that element was part of in the end. The second stream contains elements that have the cluster numbers and the calculated cluster variables. Note that the number of each cluster will be the index of one of the elements that went into it. In general it is not possible to know which one it will be though.

### Simple Example

The figures below show a simple example. The input data is simply a set of random data points in the unit square of the first quadrant. You can see the settings for the first and second tabs as well as a plot that was produced from it.

In this tab not that the x and y variables are weighted averages so they give you the average position of points in each cluster.

The spatial formulas also have an x and a y. Because the cluster variables are added after the spatial ones, the spatial xi, xj, yi, and yj are effective lost and those values are only used for building the tree. In this example the general formula do not use the xi, etc. from the clusters, but instead use normal value expressions. Note that v'[0] refers to the "j" index. Note that the formula for dist uses the variables defined right above it.

This plot shows the clusters that were found. Each cluster has a unique color. The plus signs show the cluster variables for x and y. This is the average position of the different clusters.

### Gravity Settings

A common intended usage of this filter is to find clusters of particles in N-body simulations that are gravitationally bound. Because the settings for this are fairly involved, a button was put on the GUI that will give you a fairly complete structure for doing this. The formulas that it has are designed to be used with the output of a Cartesian and Radius source. That source has position, velocity, and radius, but no mass. If you have data with a mass value you should change this to use it. Because of the lack of a mass, the one thing that must be changed is that you need to add a density to the mass formula under the cluster values. The clustering condition is a negative total energy.

The search radius is a multiple of the hill sphere for units where the mass of the central body is one. Also, the potential energy assumes units where G=1.

This plot shows the largest cluster highlighted.

### Computational Complexity

As was mentioned above, this filter can be very slow. How slow depends a lot on the different settings. With both cluster values and spatial tree off, it is O(n^2) worst case as all pairs of elements must be compared. The spatial tree will typically improve this as smaller search radii will mean that each "i" element will be compared to a smaller number of "j" elements. Cluster values, on the other hand, will typically slow things down to at worst O(n^3). This is because all the work that is done without cluster values is done repeatedly when cluster values are on. At the end of each time, of any merger was done, all the pairs are searched again because the merged cluster might be able to join in ways the unmerged ones did not.