You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
we find high level of activation sparsity (> 85%) when training SquaredRELU based FFNs instead of SwiGLU FFNs. These Squared-RELU based FFNs show minimal to no accuracy loss.
We accelerate the sparse activation x dense weight matmul with 2:4 sparsity. We can naively sparsity for the forwards pass, dropping values to fit the 2:4 constraint if they do not fit. For the backwards pass, we need some special sauce to mantain accuraccy.
However @janeyx99 pointed out to me that instead of accelerating the model using 2:4 sparsity, we can seek to exploit (1) with activation compression instead. The idea here is that we can use something like nvcomp to compress the sparse squared-relu activations.
We should run some tests to know what compression ratio and thus the memory savings we could achieve, as well as if there's additional overhead for the compression to account for.
The text was updated successfully, but these errors were encountered:
Hi @jcaip, this seems an interesting take on activation sparsity. I would like to know if the model activations are highly sparse (>85% onwards), wont restricting them to a 50% sparsity be creating a hard upper bound? I think an unstructured sparse kernel make more sense in such scenarios, and makes for CPU inferencing a case as well.
We've come up with a training recipe for 2:4 activation sparsity, which is outlined in this paper: https://openreview.net/pdf?id=O5feVk7p6Y
The gist of this approach is that:
However @janeyx99 pointed out to me that instead of accelerating the model using 2:4 sparsity, we can seek to exploit (1) with activation compression instead. The idea here is that we can use something like nvcomp to compress the sparse squared-relu activations.
We should run some tests to know what compression ratio and thus the memory savings we could achieve, as well as if there's additional overhead for the compression to account for.
The text was updated successfully, but these errors were encountered: