get_model_plus_scores
is a convenience function that trains multiple Ridge regressions in a cross-validation scheme and evaluates their performance on the respective test set.
stimulus = np.random.randn(1000, 5)
fmri = np.random.randn(1000, 10)
Using the default Ridge regression
We can now use get_model_plus_scores
to estimate multiple RidgeCV regressions, one for each voxel (that maps the stimulus representation to this voxel) and one for each split (trained on a different training set and evaluated on the held-out set).
Since sklearn's RidgeCV
estimator allows multi-output, we get one RidgeCV
object per split.
ridges, scores = get_model_plus_scores(stimulus, fmri, cv=3)
assert len(ridges) == 3
ridges
Each RidgeCV
estimator maps from the feature space to each voxel.
In our example, that means it has 10 (the number of voxels-9 independently trained regression models with 5 coeficients each (the number of features).
assert ridges[0].coef_.shape == (10, 5)
print(ridges[0].coef_)
We also get a set of scores (by default the product moment correlation, but you can supply your own via the scorer
argument) that specifies how well we predict left-out data (with the usual caveats of using a correlation coefficient for evaluating it). In our case it is of shape (10, 3) because we predict 10 voxels and use a 3-fold cross-validation, i.e. we split 3 times.
assert scores.shape == (10, 3)
scores
We can also change the parameters of the RidgeCV
function.
For example, we can use pre-specified hyperparameters, like the values of the regularization parameter $\alpha$ we want to perform a gridsearch over or whether we want to normalize features. If we want to use other parameters for the default RidgeCV
, we can just pass the parameters as additional keyword arguments:
alphas = [100]
ridges, scores = get_model_plus_scores(stimulus, fmri, alphas=alphas,
normalize=True, alpha_per_target=True)
assert ridges[0].normalize
assert ridges[0].alphas.shape == (1,)
Using your own estimator
Additionally, we can use any other estimator that implements fit
and predict
.
For example, we can use CCA as an encoding model.
from sklearn import cross_decomposition
our_estimator = cross_decomposition.CCA(n_components=2)
ccas, scores = get_model_plus_scores(stimulus, fmri, our_estimator,
cv=3)
assert type(ccas[0]) == cross_decomposition._pls.CCA
If your favorite estimator does not work in the multioutput regime, i.e. it cannot predict multiple targets/voxels, then get_model_plus_scores
will wrap it into sklearn's MultiOutputRegressor by default. However, for many voxels this can increase training time by a lot.
from sklearn.linear_model import Lasso
from sklearn.multioutput import MultiOutputRegressor
our_estimator = MultiOutputRegressor(Lasso())
lassos, scores = get_model_plus_scores(stimulus, fmri, our_estimator,
cv=3)
lassos
Training without validation
We can also train an estimator without any validation, if, for example we want to test on a different dataset. In that case, the scores will be computed with the trained estimator on the training set, i.e. they will contain no information about the generalization performance of the estimator.
our_estimator = RidgeCV()
model, scores = get_model_plus_scores(stimulus, fmri, our_estimator,
validate=False)
assert type(model) == RidgeCV
assert scores.shape == (10,)
Using your own cross-validation method
Instead of the default KFold
cross-validation without shuffling, we can also use any sckit-learn compatible cross-validation iterators (e.g. these).
For example, we could use a TimeSerisSplit
to test our predictions on only the most recent part of the data.
from sklearn.model_selection import TimeSeriesSplit
ts_cv = TimeSeriesSplit(n_splits=5)
model, scores = get_model_plus_scores(stimulus, fmri, cv=ts_cv)
assert scores.shape == (10, 5)
Distributed training
Voxel-wise encoding models can take a long time and a lot of memory to train, especially if we use the full brain or high resolution fMRI data.
The BlockMultiOutput
class can help distribute the load across multiple cores by splitting the fMRI data into multiple "blocks" (the n_blocks
parameter) and training an estimator for each block.
Without parallelization, this class allows one to train voxel-wise encoding models, even if training a single, large estimator takes up too much memory, by training the estimator for blocks of your data independently.
This works even if the original fMRI data do not fit into memory, by using a memmapped Numpy array.
our_estimator = BlockMultiOutput(RidgeCV(alphas=[10,100]))
estimators, scores = get_model_plus_scores(stimulus, fmri, our_estimator,
cv=3)
assert len(estimators) == 3
assert len(estimators[0].estimator.alphas) == 2
Each BlockMultiOutput
estimator contains n_blocks
estimators that are trained on different blocks of the target.
assert len(estimators[0].estimators_) == estimators[0].n_blocks
estimators[0].estimators_
If fmri
is of shape $(n\_samples, n\_targets)$, each of the n_blocks
estimators in BlockMultiOutput.estimators_
will contain the coefficients for ${n\_targets}/{n\_blocks}$ targets.
assert estimators[0].estimators_[0].coef_.shape == (1, 5)
print('fmri shape: {} \nn_blocks: {} \n'
'coefficients of the estimator for one block: {}'.format(
fmri.shape, our_estimator.n_blocks, estimators[0].estimators_[0].coef_.shape))
We can use MultiBlockOutput
instance normally to predict data, i.e. it produces predictions of the full fmri data by concatenating the predictions of every block-estimator.
assert estimators[0].predict(stimulus).shape == (1000, 10)
our_estimator = BlockMultiOutput(RidgeCV(alphas=[10,100]), n_jobs=10)
estimators, scores = get_model_plus_scores(stimulus, fmri, our_estimator,
cv=3)
assert len(estimators) == 3
assert estimators[0].n_jobs == 10