Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Tidy-up and improve ergonomics with new interface and dataset #45

Closed
wants to merge 37 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
f885f1d
Introduce traits and dataset
bytesnake Sep 16, 2020
3c117bf
Merge remote-tracking branch 'upstream/master' into traits
bytesnake Oct 12, 2020
271d1cc
Rename `fit_update` to `fit_with`
bytesnake Oct 12, 2020
aae00c0
Add `get_label` to dataset struct
bytesnake Oct 12, 2020
f29efb2
Every dataset can also act as data alone
bytesnake Oct 12, 2020
b4605b0
Rename data to records to disambiguate from ndarray
bytesnake Oct 12, 2020
4c3f033
Port `linfa-kernel` to new syntax
bytesnake Oct 12, 2020
e4c3df2
Start porting SVM to new architecture
bytesnake Oct 12, 2020
d7e2528
Introduce phantom type to SVM
bytesnake Oct 13, 2020
35173f4
Fit with different targets
bytesnake Oct 14, 2020
071c94e
Add Fit to SVRegression
bytesnake Oct 15, 2020
6f3231f
Implement ConfusionMatrix for dataset struct
bytesnake Oct 15, 2020
e0cbff5
Port BinaryClassification to dataset struct
bytesnake Oct 15, 2020
92794ee
First working example for SVM
bytesnake Oct 15, 2020
a0541e6
Run autoformatting
bytesnake Oct 15, 2020
b9a5218
Move to new API for tests in `linfa-svm`
bytesnake Oct 16, 2020
5f689e8
Move wine quality example to linfa dataset
bytesnake Oct 17, 2020
549bf8d
Add support vector regression tests
bytesnake Oct 17, 2020
c5b40f6
Implement transformer for hierarchical clustering
bytesnake Oct 18, 2020
e10ddd5
Add option to choose ndarray backend for `linfa-ica`
bytesnake Oct 18, 2020
0167887
Move `linfa-ica` to new traits
bytesnake Oct 18, 2020
0e48991
Implement new traits for KMeans
bytesnake Oct 18, 2020
7f06b36
Implement transformer for DBSCAN
bytesnake Oct 18, 2020
46f5e2e
Move PCA and diffusion maps to new traits
bytesnake Oct 18, 2020
0e3ba8f
Add prelude to linfa
bytesnake Oct 19, 2020
a060c85
Fix tests of classification metrics
bytesnake Oct 21, 2020
dd504fb
Add text how to contribute
bytesnake Oct 21, 2020
eb7d5d3
Remove associated type from `Labels`
bytesnake Oct 25, 2020
ae940bf
Remove labels from Dataset
bytesnake Oct 25, 2020
e710d51
Run fmt and remove serde dependency in reduction
bytesnake Oct 25, 2020
430faf4
Make serde optional in clustering
bytesnake Oct 25, 2020
6e5c608
Add section on datasets to contribute document
bytesnake Oct 25, 2020
93ac0aa
Add section on serde feature
bytesnake Oct 25, 2020
a9e1161
Add one-vs-all function
bytesnake Oct 25, 2020
7b0591d
Add error type
bytesnake Oct 25, 2020
58113a1
Add error type for parameters and ndarray
bytesnake Oct 25, 2020
fdf2411
Add section on builder patterns
bytesnake Nov 2, 2020
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
134 changes: 134 additions & 0 deletions CONTRIBUTE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,134 @@
# How to contribute to the Linfa project

This document should be used as a reference when contributing to Linfa. It describes how an algorithm should be implemented to fit well into the Linfa ecosystem. First, there are implementation details, how to use a generic float type, how to use the `Dataset` type in arguments etc. Second, the cargo manifest should be set up, such that a user can choose for different backends.

## Datasets and learning traits

An important part of the Linfa ecosystem is how to organize data for the training and estimation process. A [Dataset](src/dataset/mod.rs) serves this purpose. It is a small wrapper of data and targets types and should be used as argument for the [Fit](src/traits.rs) trait. Its parametrization is generic, with [Records](src/dataset/mod.rs) representing input data (atm only implemented for `ndarray::ArrayBase`) and [Targets](src/dataset/mod.rs) for targets.

You can find traits for different classes of algorithms [here](src/traits.rs). For example, to implement a fittable algorithm, which takes a `Kernel` as input data and boolean array as targets:
```rust
impl<'a, F: Float> Fit<'a, Kernel<'a, F>, Vec<bool>> for SvmParams<F, Pr> {
type Object = Svm<'a, F, Pr>;

fn fit(&self, dataset: &'a Dataset<Kernel<'a, F>, Vec<bool>>) -> Self::Object {
...
}
}
```
the type of the dataset is `&'a Dataset<Kernel<'a, F>, Vec<bool>>`, ensuring that the kernel lives long enough during the training. It produces a fitted state, called `Svm<'a, F, Pr>` with probability type `Pr`.

The [Predict](src/traits.rs) should be implemented with dataset arguments, as well as arrays. If a dataset is provided, then predict takes its ownership and returns a new dataset with predicted targets. For an array, predict takes a reference and returns predicted targets. In the same context, SVM implemented predict like this:
```rust
impl<'a, F: Float, T: Targets> Predict<Dataset<Array2<F>, T>, Dataset<Array2<F>, Vec<Pr>>>
for Svm<'a, F, Pr>
{
fn predict(&self, data: Dataset<Array2<F>, T>) -> Dataset<Array2<F>, Vec<Pr>> {
...
}
}
```
and
```rust
impl<'a, F: Float, D: Data<Elem = F>> Predict<ArrayBase<D, Ix2>, Vec<Pr>> for Svm<'a, F, Pr> {
fn predict(&self, data: ArrayBase<D, Ix2>) -> Vec<Pr> {
...
}
}
```

For an example of a `Transformer` please look into the [linfa-kernel](linfa-kernel/src/lib.rs) implementation.

## Parameters and builder

An algorithm has a number of hyperparameters, describing how it operates. This section describes how the algorithm's structs should be organized in order to conform with other implementations.

Imagine we have an implementation of `MyAlg`, there should a separate struct called `MyAlgParams`. The method `MyAlg::params(..) -> MyAlgParams` constructs a parameter set with default parameters and optionally required arguments (for example the number of clusters). If no parameters are required, then `std::default::Default` can be implemented as well:
```rust
impl Default for MyAlg {
fn default() -> MyAlgParams {
MyAlg::params()
}
}
```

The `MyAlgParams` should implement the Consuming Builder pattern, explained in the [Rust Book](https://doc.rust-lang.org/1.0.0/style/ownership/builders.html). Each hyperparameter gets a single field in the struct, as well as a method to modify it. Sometimes a random number generator is used in the training process. Then two separate methods should take a seed or a random number generator. With the seed a default RNG is initialized, for example [Isaac64](https://docs.rs/rand_isaac/0.2.0/rand_isaac/isaac64/index.html).

With a constructed set of parameters, the `MyAlgParams::fit(..) -> Result<MyAlg>` executes the learning process and returns a learned state. If one of the parameters is invalid (for example out of a required range), then an `Error::InvalidState` should be returned. For transformers there is only `MyAlg`, and no `MyAlgParams`, because there is no hidden state to be learned.

Following this convention, the pattern can be used by the user like this:
```rust
MyAlg::params()
.eps(1e-5)
.backwards(true)
...
.fit(&dataset)?;
```

## Let the user choose their favourite linear algebra library

One important decision for the user of Linfa is the linear algebra library backend. The `ndarray-linalg` library supports at the moment `openblas`, `netblas` or `intel MKL` as backends. It is considered good practice to allow the user to choose, which backend they wants to use. Let's say you're using the `linfa-kernel` subcrate and have added `ndarray-linalg` as a dependency as well, then your cargo manifest should add the corresponding features:

```
[features]
default = []
openblas = ["ndarray-linalg/openblas", "linfa-kernel/openblas"]
intel-mkl = ["ndarray-linalg/intel-mkl", "linfa-kernel/intel-mkl"]
netlib = ["ndarray-linalg/netlib", "linfa-kernel/netlib"]

[dependencies]
ndarray = { version = "0.13", default-features=false, features=["blas"] }
ndarray-linalg = { version = "0.12" }
linfa-kernel = { path = "../linfa-kernel" }
...

```

## Use a specific backend for testing

When you're implementing tests, relying on `ndarray-linalg`, you have to add the `openblas-src` crate. This will instruct cargo to compile the backend, in order to find the required symbols. Your cargo manifest should have a dependency
```
[dev-dependencies]
...
openblas-src = "0.9"
```
and you have to add an `extern crate openblas_src` in the `tests` module.

## Generic float types

Every algorithm should be implemented for `f32` and `f64` floating points. This can be achieved with the `linfa::Float` trait, which is basically just a combination of `ndarray::NdFloat` and `num_traits::Float`. You can look up most of the constants (like zero, one, PI) in the `num_traits` documentation. Here is a small example for a function, generic over `Float`:
```rust
use linfa::Float;
fn div_capped<F: Float>(num: F) {
F::one() / (num + F::from(1e-5).unwrap())
}
```

## Make serde optionally

If you want to implement `Serialize` and `Deserialize` for your parameters, please do that behind a feature flag. You can add to your cargo manifest
```
[features]
serde = ["serde_crate", "ndarray/serde"]

[dependencies.serde_crate]
package = "serde"
optional = true
version = "1.0"
```
which basically renames the `serde` crate to `serde_crate` and adds a feature `serde`. In your parameter struct, move the macro definition behind the `serde` feature:
```rust
#[cfg(feature = "serde")]
use serde_crate::{Deserialize, Serialize};

#[cfg_attr(
feature = "serde",
derive(Serialize, Deserialize),
serde(crate = "serde_crate")
)]
#[derive(Clone, Debug, PartialEq)]
pub struct HyperParams {
...
}
```

17 changes: 1 addition & 16 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -13,24 +13,9 @@ keywords = ["machine-learning", "linfa", "ai", "ml"]
categories = ["algorithms", "mathematics", "science"]

[dependencies]
ndarray = { version = "0.13", default-features = false }
ndarray = "0.13"
num-traits = "0.1.32"

linfa-clustering = { path = "linfa-clustering", version = "0.1" }
linfa-trees = { path = "linfa-trees", version = "0.1" }
linfa-reduction = { path = "linfa-reduction", version = "0.1" }
linfa-linear = { path = "linfa-linear", version = "0.1" }
linfa-logistic = { path = "linfa-logistic", version = "0.1" }
linfa-svm = { path = "linfa-svm", version = "0.1" }
linfa-hierarchical = { path = "linfa-hierarchical", version = "0.1" }

[dev-dependencies]
rand = "0.7"
ndarray-rand = "0.11"
rand_isaac = "0.2.0"
ndarray-npy = { version = "0.5", default-features = false }
approx = "0.3"

[workspace]
members = [
"linfa-clustering",
Expand Down
24 changes: 19 additions & 5 deletions linfa-clustering/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -12,18 +12,32 @@ readme = "README.md"
keywords = ["clustering", "machine-learning", "linfa", "k-means", "unsupervised"]
categories = ["algorithms", "mathematics", "science"]

[features]
default = []
openblas = ["ndarray-linalg/openblas"]
intel-mkl = ["ndarray-linalg/intel-mkl"]
netlib = ["ndarray-linalg/netlib"]
serde = ["serde_crate", "ndarray/serde"]

[dependencies.serde_crate]
package = "serde"
optional = true
version = "1.0"
default-features = false
features = ["std", "derive"]

[dependencies]
ndarray = { version = "0.13" , features = ["rayon", "serde", "approx"]}
ndarray = { version = "0.13" , features = ["rayon", "approx"]}
ndarray-rand = "0.11"
ndarray-stats = "0.3"
ndarray-linalg = { version = "0.12", features = ["openblas"] }
ndarray-linalg = { version = "0.12" }
sprs = "0.7"
serde = { version = "1", features = ["derive"] }
num-traits = "0.1.32"
linfa-kernel = { path = "../linfa-kernel" }
rand_isaac = "0.2.0"
linfa = { path = ".." }

[dev-dependencies]
rand_isaac = "0.2.0"
openblas-src = "0.9"
ndarray-npy = { version = "0.5", default-features = false }
criterion = "0.3"
serde_json = "1"
Expand Down
10 changes: 6 additions & 4 deletions linfa-clustering/examples/dbscan.rs
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
use linfa_clustering::{generate_blobs, Dbscan, DbscanHyperParams};
use linfa::traits::Transformer;
use linfa_clustering::{generate_blobs, Dbscan};
use ndarray::array;
use ndarray_npy::write_npy;
use ndarray_rand::rand::SeedableRng;
Expand All @@ -17,10 +18,11 @@ fn main() {

// Configure our training algorithm
let min_points = 3;
let hyperparams = DbscanHyperParams::new(min_points).tolerance(1e-5).build();

// Infer an optimal set of centroids based on the training data distribution
let cluster_memberships = Dbscan::predict(&hyperparams, &dataset);
let cluster_memberships = Dbscan::params(min_points)
.tolerance(1e-5)
.build()
.transform(&dataset);

// Save to disk our dataset (and the cluster label assigned to each observation)
// We use the `npy` format for compatibility with NumPy
Expand Down
28 changes: 14 additions & 14 deletions linfa-clustering/examples/kmeans.rs
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
use linfa_clustering::{generate_blobs, KMeans, KMeansHyperParams};
use linfa::traits::{Fit, Predict};
use linfa::Dataset;
use linfa_clustering::{generate_blobs, KMeans};
use ndarray::{array, Axis};
use ndarray_npy::write_npy;
use ndarray_rand::rand::SeedableRng;
Expand All @@ -13,27 +15,25 @@ fn main() {
// For each our expected centroids, generate `n` data points around it (a "blob")
let expected_centroids = array![[10., 10.], [1., 12.], [20., 30.], [-20., 30.],];
let n = 10000;
let dataset = generate_blobs(n, &expected_centroids, &mut rng);
let dataset = Dataset::from(generate_blobs(n, &expected_centroids, &mut rng));

// Configure our training algorithm
let n_clusters = expected_centroids.len_of(Axis(0));
let hyperparams = KMeansHyperParams::new(n_clusters)
let model = KMeans::params_with_rng(n_clusters, rng)
.max_n_iterations(200)
.tolerance(1e-5)
.build();

// Infer an optimal set of centroids based on the training data distribution
let model = KMeans::fit(hyperparams, &dataset, &mut rng);
.build()
.fit(&dataset);

// Assign each point to a cluster using the set of centroids found using `fit`
let cluster_memberships = model.predict(&dataset);
let dataset = model.predict(dataset);
let Dataset {
records, targets, ..
} = dataset;

// Save to disk our dataset (and the cluster label assigned to each observation)
// We use the `npy` format for compatibility with NumPy
write_npy("clustered_dataset.npy", dataset).expect("Failed to write .npy file");
write_npy(
"clustered_memberships.npy",
cluster_memberships.map(|&x| x as u64),
)
.expect("Failed to write .npy file");
write_npy("clustered_dataset.npy", records).expect("Failed to write .npy file");
write_npy("clustered_memberships.npy", targets.map(|&x| x as u64))
.expect("Failed to write .npy file");
}
Loading