Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add calculate_cng_indices #97

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
67 changes: 66 additions & 1 deletion factor_analyzer/factor_analyzer.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,12 +16,13 @@
from scipy.optimize import minimize
from scipy.stats import chi2, pearsonr
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.linear_models import LinearRegression
from sklearn.utils import check_array
from sklearn.utils.extmath import randomized_svd
from sklearn.utils.validation import check_is_fitted

from .rotator import OBLIQUE_ROTATIONS, POSSIBLE_ROTATIONS, Rotator
from .utils import corr, impute_values, partial_correlations, smc
from .utils import corr, covariance_to_correlation, impute_values, partial_correlations, smc

POSSIBLE_SVDS = ["randomized", "lapack"]

Expand Down Expand Up @@ -112,6 +113,70 @@ def calculate_bartlett_sphericity(x):
return statistic, p_value


def calculate_cng_indices(
data: np.ndarray, model: str = "components"
) -> Tuple[int, pd.DataFrame]:
"""Calculate the Cattel-Nelson-Gorsuch indices, which are used to determine
the appropriate number of factors for a factor analysis.

Direct port of nCng function from nFactors package:
https://rdrr.io/cran/nFactors/man/nCng.html

Parameters
----------
data : array-like
The array of samples x observable for which to calculate CNG indices
model : str
"components" or "factors"

Returns
-------
num_factors : int
The number of components/factors to retain
details : pd.DataFrame
The eigenvalues and CNG indices of the dataset
"""
data = corr(data)
if model == "factors":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're just going to have "factors", do we need the model argument? Does it make sense to exclude this for now?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if model == "components", the adjustment underneath is not needed. In other words, only taking the correlation of the data is sufficient. We might want to add elif model != "components": raise ValueError

data -= np.linalg.pinv(np.diag(np.diag(np.linalg.pinv(data))))
elif model != "components":
raise ValueError(f"model must be one of 'factors' or 'components'. Received '{model}'")

values = np.sort(np.linalg.eigvals(data))[::-1]

num_variables = len(data)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't len(data) give you the number of observations, rather than the number of variables, assuming data is a numpy array or pandas data frame?

Unfortunately, this doesn't translate exactly from R:

> df <- data.frame(A = c(1, 2, 3), B = c(1, 2, 3))
> length(df) 
[1] 2
>>> import pandas as pd
>>> df = pd.DataFrame({'A': [1, 2, 3], 'B': [1, 2, 3]})
>>> len(df)
3
>>> len(df.values)
3

Also, should we move the line below to the beginning?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this line of the code data refers to the correlation matrix, which is square. data.shape[0] == data.shape[1]

if num_variables < 6:
raise ValueError("The number of variables must be at least 6")

fit_size = 3
cng = np.diff(
[
[
LinearRegression()
.fit(idx_values[:, np.newaxis], values[idx_values])
.coef_
for idx_values in [
np.arange(idx_fit, idx_fit + fit_size),
np.arange(idx_fit + fit_size, idx_fit + 2 * fit_size),
]
]
for idx_fit in range(num_variables - 2 * fit_size)
],
axis=1,
).squeeze(axis=(1, 2))

num_factors = np.nanargmax(cng) + fit_size

details = pd.DataFrame(
{
"data": values[: len(cng)],
"cng": cng,
}
)

return num_factors, details


class FactorAnalyzer(BaseEstimator, TransformerMixin):
"""
The main exploratory factor analysis class.
Expand Down