-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added functionality to allow X to be a precomputed distance matrix #55
base: master
Are you sure you want to change the base?
Changes from 2 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1260,6 +1260,72 @@ def construct_tree(X, k, prune_threshold=None, num_levels=None, verbose=False): | |
return tree | ||
|
||
|
||
def construct_tree_from_precomputed_matrix(X, p, k, prune_threshold=None, num_levels=None, verbose=False): | ||
""" | ||
Construct a level set tree from tabular data. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "from tabular data" should be updated to "from a precomputed distance matrix." |
||
|
||
Parameters | ||
---------- | ||
X : 2-dimensional numpy array | ||
A square distance matrix | ||
|
||
p: int | ||
Underlying dimensionality of the data that was used to create X, the distance matrix | ||
|
||
k : int | ||
Number of observations to consider as neighbors to a given point. | ||
|
||
prune_threshold : int, optional | ||
Leaf nodes with fewer than this number of members are recursively | ||
merged into larger nodes. If 'None' (the default), then no pruning | ||
is performed. | ||
|
||
num_levels : int, optional | ||
Number of density levels in the constructed tree. If None (default), | ||
`num_levels` is internally set to be the number of rows in `X`. | ||
|
||
verbose : bool, optional | ||
If True, a progress indicator is printed at every 100th level of tree | ||
construction. | ||
|
||
Returns | ||
------- | ||
T : LevelSetTree | ||
A pruned level set tree. | ||
|
||
See Also | ||
-------- | ||
construct_tree, construct_tree_from_graph, LevelSetTree | ||
|
||
Examples | ||
-------- | ||
>>> X = numpy.random.rand(100, 2) | ||
>>> X = scipy.spatial.distance.pdist(data) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I know this is super pedantic, but I prefer not to overwrite the original data. Maybe call this |
||
>>> X = scipy.spatial.distance.squareform(matrix) | ||
>>> tree = debacl.construct_tree_from_precomputed_matrix(X, p=2, k=8, prune_threshold=5) | ||
>>> print(tree) | ||
+----+-------------+-----------+------------+----------+------+--------+----------+ | ||
| id | start_level | end_level | start_mass | end_mass | size | parent | children | | ||
+----+-------------+-----------+------------+----------+------+--------+----------+ | ||
| 0 | 0.000 | 0.870 | 0.000 | 0.450 | 100 | None | [3, 4] | | ||
| 3 | 0.870 | 3.364 | 0.450 | 0.990 | 17 | 0 | [] | | ||
| 4 | 0.870 | 1.027 | 0.450 | 0.520 | 35 | 0 | [7, 8] | | ||
| 7 | 1.027 | 1.755 | 0.520 | 0.870 | 8 | 4 | [] | | ||
| 8 | 1.027 | 3.392 | 0.520 | 1.000 | 23 | 4 | [] | | ||
+----+-------------+-----------+------------+----------+------+--------+----------+ | ||
""" | ||
|
||
sim_graph, radii = _utl.knn_graph(X, k, method='precomputed') | ||
|
||
n, _ = X.shape | ||
density = _utl.knn_density(radii, n, p, k) | ||
|
||
tree = construct_tree_from_graph(adjacency_list=sim_graph, density=density, | ||
prune_threshold=prune_threshold, | ||
num_levels=num_levels, verbose=verbose) | ||
return tree | ||
|
||
|
||
def construct_tree_from_graph(adjacency_list, density, prune_threshold=None, | ||
num_levels=None, verbose=False): | ||
""" | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -38,12 +38,13 @@ def knn_graph(X, k, method='brute_force', leaf_size=30): | |
Parameters | ||
---------- | ||
X : numpy array | list [numpy arrays] | ||
Data points, with each row as an observation. | ||
Data points, with each row as an observation | ||
OR a distance matrix if method='precomputed' | ||
|
||
k : int | ||
The number of points to consider as neighbors of any given observation. | ||
|
||
method : {'brute-force', 'kd-tree', 'ball-tree'}, optional | ||
method : {'brute-force', 'kd-tree', 'ball-tree', 'precomputed'}, optional | ||
Computing method. | ||
|
||
- 'brute-force': computes the (Euclidean) distance between all O(n^2) | ||
|
@@ -61,11 +62,14 @@ def knn_graph(X, k, method='brute_force', leaf_size=30): | |
distances. Typically much faster than 'brute-force', and works with | ||
up to a few hundred dimensions. Requires the scikit-learn library. | ||
|
||
- 'precomputed': used when the matrix X is a precomputed distance | ||
matrix. | ||
|
||
leaf_size : int, optional | ||
For the 'kd-tree' and 'ball-tree' methods, the number of observations | ||
in the leaf nodes. Leaves are not split further, so distance | ||
computations within leaf nodes are done by brute force. 'leaf_size' is | ||
ignored for the 'brute-force' method. | ||
ignored for the 'brute-force' and 'precomputed' methods. | ||
|
||
Returns | ||
------- | ||
|
@@ -109,6 +113,19 @@ def knn_graph(X, k, method='brute_force', leaf_size=30): | |
raise ImportError("The scikit-learn library could not be loaded." + | ||
" It is required for the 'ball-tree' method.") | ||
|
||
if method == 'precomputed': | ||
# Pseudo-sort each row of the distance matrix so that the indices of the closest k elements are listed first | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should probably validate here that input |
||
# The entries [...,:k] are not in sorted order by distance | ||
neighbors = _np.argpartition(X, k)[:, 0:k] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good idea - I didn't know numpy had this! On a 1,000 x 1,000 matrix, my crude ipython timing says this is 4x faster than argsort. |
||
|
||
# Get the distance values for the first k elements, and then sort 'neighbors' by those distance values | ||
radii = X[_np.arange(n)[:, None], neighbors[:]] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If I'm reading the documentation for
|
||
order = radii.argsort(axis=1) | ||
neighbors = neighbors[_np.arange(n)[:, None], order] | ||
|
||
radii = radii[_np.arange(n)[:, None], order] | ||
radii = radii[:, -1] | ||
|
||
else: # assume brute-force | ||
if not _HAS_SCIPY: | ||
raise ImportError("The 'scipy' module could not be loaded. " + | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we change "precomputed_matrix" to "distance_matrix"? I think it would be a little more explicit.