rust-ml · bytesnake · Sep 16, 2020 · Oct 12, 2020 · Oct 12, 2020 · Oct 12, 2020
diff --git a/CONTRIBUTE.md b/CONTRIBUTE.md
@@ -0,0 +1,134 @@
+# How to contribute to the Linfa project
+
+This document should be used as a reference when contributing to Linfa. It describes how an algorithm should be implemented to fit well into the Linfa ecosystem. First, there are implementation details, how to use a generic float type, how to use the `Dataset` type in arguments etc. Second, the cargo manifest should be set up, such that a user can choose for different backends. 
+
+## Datasets and learning traits
+
+An important part of the Linfa ecosystem is how to organize data for the training and estimation process. A [Dataset](src/dataset/mod.rs) serves this purpose. It is a small wrapper of data and targets types and should be used as argument for the [Fit](src/traits.rs) trait. Its parametrization is generic, with [Records](src/dataset/mod.rs) representing input data (atm only implemented for `ndarray::ArrayBase`) and [Targets](src/dataset/mod.rs) for targets.
+
+You can find traits for different classes of algorithms [here](src/traits.rs). For example, to implement a fittable algorithm, which takes a `Kernel` as input data and boolean array as targets:
+```rust
+impl<'a, F: Float> Fit<'a, Kernel<'a, F>, Vec<bool>> for SvmParams<F, Pr> {
+    type Object = Svm<'a, F, Pr>;
+
+    fn fit(&self, dataset: &'a Dataset<Kernel<'a, F>, Vec<bool>>) -> Self::Object {
+        ...
+    }
+}
+```
+the type of the dataset is `&'a Dataset<Kernel<'a, F>, Vec<bool>>`, ensuring that the kernel lives long enough during the training. It produces a fitted state, called `Svm<'a, F, Pr>` with probability type `Pr`.
+
+The [Predict](src/traits.rs) should be implemented with dataset arguments, as well as arrays. If a dataset is provided, then predict takes its ownership and returns a new dataset with predicted targets. For an array, predict takes a reference and returns predicted targets. In the same context, SVM implemented predict like this:
+```rust
+impl<'a, F: Float, T: Targets> Predict<Dataset<Array2<F>, T>, Dataset<Array2<F>, Vec<Pr>>>
+    for Svm<'a, F, Pr>
+{
+    fn predict(&self, data: Dataset<Array2<F>, T>) -> Dataset<Array2<F>, Vec<Pr>> {
+        ...
+    }
+}
+```
+and
+```rust
+impl<'a, F: Float, D: Data<Elem = F>> Predict<ArrayBase<D, Ix2>, Vec<Pr>> for Svm<'a, F, Pr> {
+    fn predict(&self, data: ArrayBase<D, Ix2>) -> Vec<Pr> {
+        ...
+    }
+}
+```
+
+For an example of a `Transformer` please look into the [linfa-kernel](linfa-kernel/src/lib.rs) implementation.
+
+## Parameters and builder
+
+An algorithm has a number of hyperparameters, describing how it operates. This section describes how the algorithm's structs should be organized in order to conform with other implementations. 
+
+Imagine we have an implementation of `MyAlg`, there should a separate struct called `MyAlgParams`. The method `MyAlg::params(..) -> MyAlgParams` constructs a parameter set with default parameters and optionally required arguments (for example the number of clusters). If no parameters are required, then `std::default::Default` can be implemented as well:
+```rust
+impl Default for MyAlg {
+    fn default() -> MyAlgParams {
+        MyAlg::params()
+    }
+}
+```
+
+The `MyAlgParams` should implement the Consuming Builder pattern, explained in the [Rust Book](https://doc.rust-lang.org/1.0.0/style/ownership/builders.html). Each hyperparameter gets a single field in the struct, as well as a method to modify it. Sometimes a random number generator is used in the training process. Then two separate methods should take a seed or a random number generator. With the seed a default RNG is initialized, for example [Isaac64](https://docs.rs/rand_isaac/0.2.0/rand_isaac/isaac64/index.html).
+
+With a constructed set of parameters, the `MyAlgParams::fit(..) -> Result<MyAlg>` executes the learning process and returns a learned state. If one of the parameters is invalid (for example out of a required range), then an `Error::InvalidState` should be returned. For transformers there is only `MyAlg`, and no `MyAlgParams`, because there is no hidden state to be learned.
+
+Following this convention, the pattern can be used by the user like this:
+```rust
+MyAlg::params()
+    .eps(1e-5)
+    .backwards(true)
+    ...
+    .fit(&dataset)?;
+```
+
+## Let the user choose their favourite linear algebra library
+
+One important decision for the user of Linfa is the linear algebra library backend. The `ndarray-linalg` library supports at the moment `openblas`, `netblas` or `intel MKL` as backends. It is considered good practice to allow the user to choose, which backend they wants to use. Let's say you're using the `linfa-kernel` subcrate and have added `ndarray-linalg` as a dependency as well, then your cargo manifest should add the corresponding features:
+
+```
+[features]
+default = []
+openblas = ["ndarray-linalg/openblas", "linfa-kernel/openblas"]
+intel-mkl = ["ndarray-linalg/intel-mkl", "linfa-kernel/intel-mkl"]
+netlib = ["ndarray-linalg/netlib", "linfa-kernel/netlib"]
+
+[dependencies]
+ndarray = { version = "0.13", default-features=false, features=["blas"] }
+ndarray-linalg = { version = "0.12" }
+linfa-kernel = { path = "../linfa-kernel" }
+...
+
+```
+
+## Use a specific backend for testing
+
+When you're implementing tests, relying on `ndarray-linalg`, you have to add the `openblas-src` crate. This will instruct cargo to compile the backend, in order to find the required symbols. Your cargo manifest should have a dependency
+```
+[dev-dependencies]
+...
+openblas-src = "0.9" 
+```
+and you have to add an `extern crate openblas_src` in the `tests` module.
+
+## Generic float types
+
+Every algorithm should be implemented for `f32` and `f64` floating points. This can be achieved with the `linfa::Float` trait, which is basically just a combination of `ndarray::NdFloat` and `num_traits::Float`. You can look up most of the constants (like zero, one, PI) in the `num_traits` documentation. Here is a small example for a function, generic over `Float`:
+```rust
+use linfa::Float;
+fn div_capped<F: Float>(num: F) {
+    F::one() / (num + F::from(1e-5).unwrap())
+}
+```
+
+## Make serde optionally
+
+If you want to implement `Serialize` and `Deserialize` for your parameters, please do that behind a feature flag. You can add to your cargo manifest
+```
+[features]
+serde = ["serde_crate", "ndarray/serde"]
+
+[dependencies.serde_crate]
+package = "serde"
+optional = true
+version = "1.0"
+```
+which basically renames the `serde` crate to `serde_crate` and adds a feature `serde`. In your parameter struct, move the macro definition behind the `serde` feature:
+```rust
+#[cfg(feature = "serde")]
+use serde_crate::{Deserialize, Serialize};
+
+#[cfg_attr(
+    feature = "serde",
+    derive(Serialize, Deserialize),
+    serde(crate = "serde_crate")
+)]
+#[derive(Clone, Debug, PartialEq)]
+pub struct HyperParams {
+...
+}
+```
+
diff --git a/Cargo.toml b/Cargo.toml
@@ -13,24 +13,9 @@ keywords = ["machine-learning", "linfa", "ai", "ml"]
 categories = ["algorithms", "mathematics", "science"]
 
 [dependencies]
-ndarray = { version = "0.13", default-features = false }
+ndarray = "0.13"
 num-traits = "0.1.32"
 
-linfa-clustering = { path = "linfa-clustering", version = "0.1" }
-linfa-trees = { path = "linfa-trees", version = "0.1" }
-linfa-reduction = { path = "linfa-reduction", version = "0.1" }
-linfa-linear = { path = "linfa-linear", version = "0.1" }
-linfa-logistic = { path = "linfa-logistic", version = "0.1" }
-linfa-svm = { path = "linfa-svm", version = "0.1" }
-linfa-hierarchical = { path = "linfa-hierarchical", version = "0.1" }
-
-[dev-dependencies]
-rand = "0.7"
-ndarray-rand = "0.11"
-rand_isaac = "0.2.0"
-ndarray-npy = { version = "0.5", default-features = false }
-approx = "0.3"
-
 [workspace]
 members = [
     "linfa-clustering",

diff --git a/linfa-clustering/Cargo.toml b/linfa-clustering/Cargo.toml
@@ -12,18 +12,32 @@ readme = "README.md"
 keywords = ["clustering", "machine-learning", "linfa", "k-means", "unsupervised"]
 categories = ["algorithms", "mathematics", "science"]
 
+[features]
+default = []
+openblas = ["ndarray-linalg/openblas"]
+intel-mkl = ["ndarray-linalg/intel-mkl"]
+netlib = ["ndarray-linalg/netlib"]
+serde = ["serde_crate", "ndarray/serde"]
+
+[dependencies.serde_crate]
+package = "serde"
+optional = true
+version = "1.0"
+default-features = false
+features = ["std", "derive"]
+
 [dependencies]
-ndarray = { version = "0.13" , features = ["rayon", "serde", "approx"]}
+ndarray = { version = "0.13" , features = ["rayon", "approx"]}
 ndarray-rand = "0.11"
 ndarray-stats = "0.3"
-ndarray-linalg = { version = "0.12", features = ["openblas"] }
+ndarray-linalg = { version = "0.12" }
 sprs = "0.7"
-serde = { version = "1", features = ["derive"] }
 num-traits = "0.1.32"
-linfa-kernel = { path = "../linfa-kernel" }
+rand_isaac = "0.2.0"
+linfa = { path = ".." }
 
 [dev-dependencies]
-rand_isaac = "0.2.0"
+openblas-src = "0.9"
 ndarray-npy = { version = "0.5", default-features = false }
 criterion = "0.3"
 serde_json = "1"

diff --git a/linfa-clustering/examples/dbscan.rs b/linfa-clustering/examples/dbscan.rs
@@ -1,4 +1,5 @@
-use linfa_clustering::{generate_blobs, Dbscan, DbscanHyperParams};
+use linfa::traits::Transformer;
+use linfa_clustering::{generate_blobs, Dbscan};
 use ndarray::array;
 use ndarray_npy::write_npy;
 use ndarray_rand::rand::SeedableRng;
@@ -17,10 +18,11 @@ fn main() {
 
     // Configure our training algorithm
     let min_points = 3;
-    let hyperparams = DbscanHyperParams::new(min_points).tolerance(1e-5).build();
-
     // Infer an optimal set of centroids based on the training data distribution
-    let cluster_memberships = Dbscan::predict(&hyperparams, &dataset);
+    let cluster_memberships = Dbscan::params(min_points)
+        .tolerance(1e-5)
+        .build()
+        .transform(&dataset);
 
     // Save to disk our dataset (and the cluster label assigned to each observation)
     // We use the `npy` format for compatibility with NumPy

diff --git a/linfa-clustering/examples/kmeans.rs b/linfa-clustering/examples/kmeans.rs
@@ -1,4 +1,6 @@
-use linfa_clustering::{generate_blobs, KMeans, KMeansHyperParams};
+use linfa::traits::{Fit, Predict};
+use linfa::Dataset;
+use linfa_clustering::{generate_blobs, KMeans};
 use ndarray::{array, Axis};
 use ndarray_npy::write_npy;
 use ndarray_rand::rand::SeedableRng;
@@ -13,27 +15,25 @@ fn main() {
     // For each our expected centroids, generate `n` data points around it (a "blob")
     let expected_centroids = array![[10., 10.], [1., 12.], [20., 30.], [-20., 30.],];
     let n = 10000;
-    let dataset = generate_blobs(n, &expected_centroids, &mut rng);
+    let dataset = Dataset::from(generate_blobs(n, &expected_centroids, &mut rng));
 
     // Configure our training algorithm
     let n_clusters = expected_centroids.len_of(Axis(0));
-    let hyperparams = KMeansHyperParams::new(n_clusters)
+    let model = KMeans::params_with_rng(n_clusters, rng)
         .max_n_iterations(200)
         .tolerance(1e-5)
-        .build();
-
-    // Infer an optimal set of centroids based on the training data distribution
-    let model = KMeans::fit(hyperparams, &dataset, &mut rng);
+        .build()
+        .fit(&dataset);
 
     // Assign each point to a cluster using the set of centroids found using `fit`
-    let cluster_memberships = model.predict(&dataset);
+    let dataset = model.predict(dataset);
+    let Dataset {
+        records, targets, ..
+    } = dataset;
 
     // Save to disk our dataset (and the cluster label assigned to each observation)
     // We use the `npy` format for compatibility with NumPy
-    write_npy("clustered_dataset.npy", dataset).expect("Failed to write .npy file");
-    write_npy(
-        "clustered_memberships.npy",
-        cluster_memberships.map(|&x| x as u64),
-    )
-    .expect("Failed to write .npy file");
+    write_npy("clustered_dataset.npy", records).expect("Failed to write .npy file");
+    write_npy("clustered_memberships.npy", targets.map(|&x| x as u64))
+        .expect("Failed to write .npy file");
 }