Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explain the monotonically decreasing part #108

Open
amirhfarzaneh opened this issue Aug 23, 2018 · 4 comments
Open

Explain the monotonically decreasing part #108

amirhfarzaneh opened this issue Aug 23, 2018 · 4 comments

Comments

@amirhfarzaneh
Copy link

Can anybody please explain why ψ should be monotonically decreasing for every interval?

@melgor
Copy link

melgor commented Aug 24, 2018

First of all, I advice you to read paper: https://arxiv.org/abs/1801.07698
It contain some summary of L/A-Softmax based idea.
Here is very nice plot of ψ functions, based on several ideas.
image

Notice SoftMax in this plot. So look like that most of the functions just are monotonically decreasing because base SoftMax have this property. So your question should be: Why Softmax is monotonically decreasing for every interval?
By the SoftMax here we are thinking about last linear layer and SoftMax normalization (this is crucial for next).

To make things easier to explain (and it also for better for Metric-Learning), we are normalizing all features before final layer by L2 norm and also the weight of final layer are normalized by L2 norm. Then multiplying features and weights is just cosine similarity distance. How the output from such distance can look like? It is just cosine function, where x is angle, y is value from -1 to 1.
image

In this cosine plot we are interested in interval 0-180. Then it look exactly the same like in ArcFace paper. What doesangle in x axis mean? It is the measure of similarity between two vectors considering just direction of them (as magnitude is normalized). If vectors are similar, the angle ~ 0 degree (and this is our aim in training, that feature representing class should be very similar to weight representing same class), if point completely different direction (lie on the same line, but point t opposed direction) then angle is ~180.
So monotonically decreasing is natural property of cosine similarity.

This is just one way of explaining, there are many more.
But we can still think: what would happens if it would not be monotonically decreasing?
From the theoretical point of view, if the function would look like presented cosine function, but x values would be from 0 - 180 degree with interval on y [1, 0, -1, 0, 1] (so just squash the cosine to 2x smaller x axis) ?
Our aim is maximize the value of similarity for same classes and minimize for different classes.
But this function have two maximum for same classes, which network should choose? It should should choose '1' at 0 degree because at 180 is completely different vector. Also the minimum of different classes are in 90 degree, so different classes should have some similarity between them (completely nonsense).
This mean that non-monotonically decreasing can produce some local minima with very bad output, which could be very hard to escape. So it is better to design function which are monotonically.

This is my explanation which come from studding this topic for a while. It is not perfect, it would need a blog post to explain all idea behind it.

@happynear
Copy link

Great explanation. In a nutshell, the increasing part of the curve has opposite gradients.. This means that increasing curves will push features away from class center!

I'm sure you don't want such property...

@amirhfarzaneh
Copy link
Author

@melgor Thank you for your thorough response. It clarified a lot of things. I just have this question: in the first plot, it seems that the author is only drawing cos(theta) in the range 0 to 180. but the target logit is ||W|| ||X|| cos(theta) so the logit is not only dependent on the cosine function, but also the multiplication of ||w|| and ||x||

@happynear
Copy link

@amirhfarzaneh
Yes, this should not be target logit. We have already discussed this issue in happynear/AMSoftmax#8 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants