`{domir}`

’s `domin`

The focus of the `{domir}`

package is on the dominance analysis (DA) methodology for determining the relative importance of independent variables (IVs)/predictors/features in a predictive model. DA determines the relative importance of independent variables in an predictive model based on each IV’s contribution to an overall model fit statistic. More specifically, DA produces several results that decompose and compare the contribution each IV makes to the fit statistic to one another. Conceptually, DA is an extension of Shapley value decomposition from Cooperative Game Theory (see Grömping, 2007 for a discussion) which seeks to find solutions of how to subdivide “payoffs” (i.e., the fit statistic) to “players” (i.e., the IVs) based on their relative contribution to the payoff; the payoff is usually assumed to be a single value (i.e., scalar valued).

The implementation DA uses to estimate the Shapley value decomposition works by using a pre-selected predictive model and applying a very extensive experimental design to it. Effectively, DA asks treats the predictive model as data and the IVs as factors in an experiment with a within-subjects full-factorial design (i.e., all possible combinations of the IVs are applied to the data). Another way of describing the process is as a brute force method where all combinations of the IVs included or excluded is estimated from the data and the fit statistics associated with each sub-model is collected.

Assuming there are \(p\) IVs in the pre-selected model, obtaining all possible combinations of the IVs results in \(2^p\) models and fit statistics to be estimated. Again, each combination of \(p\) variables alternating between included versus excluded (see Budescu, 1993).

As is mentioned above, an implication of the DA method is that the predictive model used in it is “pre-selected”. Hence, the model has been passed through model selection procedures and the user is confident that the model used in the dominance analysis is a good reflection of the “payoff” structure. DA, at least the implementation of DA that acts like a full-factorial experiment, is not effective for model selection of IVs/finding IVs with trivial effects and removing them.

“Relative importance” as a concept is used in many different ways in statistics and data science and many of them refer to methods that are focused on, or are probably best used for, model selection/identifying trivial IVs for removal. DA is a method that is more focused on importance in a “model evaluation” sense. I see this as being an application where the user describes/interprets IVs’ effects in the context of a finalized, predictive approach.

`domin`

The `domin`

function of the `{domir}`

package implements the DA, full-factorial methodology with a flexible R function application programming interface/API. The API is based on the R formula which is used in many predictive R modeling functions in base and recommended packages such as `lm`

, `glm`

, `arima`

, `polr`

, `nnet`

, `lme`

, and `gam`

. Functions that do not use formulas, or use variants of the standard R formula (i.e., those in the `Formula`

package), can be accommodated using wrapper functions.

I see the DA methodology is a general approach to model evaluation using relative importance and it is this perspective that has guided the development of this package. In my view, if you have a pre-selected predictive model and a fit statistic that can be applied to it, the `domin`

can help you evaluate/interpret the contribution of the IVs in that model.

This section introduces some of the concepts in DA using a simple predictive model and discusses how each is displayed in the `domin`

implementation when `print`

-ed.

The focus of this section is on the concepts underlying DA and less on the implementation of DA, and how to work with, the `domin`

function.

The example below uses the *mtcars* data in the `{datasets}`

package. Assume I have pre-selected, through whatever means, the following linear model:

`lm(mpg ~ am + cyl + carb, data = mtcars)`

Applying a DA to this model using `domin`

, assuming the fit statistic of interest is the explained variance \(R^2\), would look like:

```
library(domir)
library(datasets)
domin(mpg ~ am + cyl + carb,
lm, list(summary, "r.squared"),
data = mtcars)
#> Overall Fit Statistic: 0.8113023
#>
#> General Dominance Statistics:
#> General Dominance Standardized Ranks
#> am 0.2156848 0.2658501 2
#> cyl 0.4173094 0.5143698 1
#> carb 0.1783081 0.2197801 3
#>
#> Conditional Dominance Statistics:
#> IVs: 1 IVs: 2 IVs: 3
#> am 0.3597989 0.2164938 0.07076149
#> cyl 0.7261800 0.4181185 0.10762967
#> carb 0.3035184 0.1791172 0.05228872
#>
#> Complete Dominance Designations:
#> Dmnated?am Dmnated?cyl Dmnated?carb
#> Dmnates?am NA FALSE TRUE
#> Dmnates?cyl TRUE NA TRUE
#> Dmnates?carb FALSE FALSE NA
```

`domin`

`print`

s results in four sections including fit statistics overall, as well as general and conditional dominance statistics, and complete dominance designations.

`#> Overall Fit Statistic: 0.8113023 `

This results section reports on the value of the fit statistic of the pre-selected model. Other fit statistic value adjustments are reported in this section as well.

The \(R^2\) for this model is ~\(.8113\) meaning the model explains around 80% of the variance in *mpg*; this is the limiting value for how much each IV will be able to explain. The three IVs in this model will be ascribed separate components of this ~\(.8113\) value.

```
#> General Dominance Statistics:
#> General Dominance Standardized Ranks
#> am 0.2156848 0.2658501 2
#> cyl 0.4173094 0.5143698 1
#> carb 0.1783081 0.2197801 3
```

This section reports on how the fit statistic value from the last section is divided up among the IVs. The statistics produced when dividing up the overall fit statistic associated with the pre-selected model are known as *general dominance statistics*. Because they sum to the overall fit statistic, general dominance statistics are the results from the DA that are considered easiest to interpret.

General dominance statistics are computed as a weighted average of the marginal/incremental contribution to prediction (as measured by the focal fit metric, in this case the \(R^2\)) the IV makes across all sub-models in which it is included. For example, *am* has a value of ~\(0.2157\) which means, on average, *am* results in an increment to the \(R^2\) of about twenty-two percentage points when it is included in the model versus when it is not.

The *Standardized* column of statistics expresses the general dominance statistic value as a percentage of the overall fit statistic value and thus sum to 100%. *am*’s contributions to prediction comprise ~27% of the predictive usefulness of the pre-selected model (i.e., \(\frac{.2157}{.8113} = .2659\)).

Finally, the general dominance statistics can be compared to one another to determine which variables generally dominate others. Here, *cyl*’s larger general dominance statistic than *am* indicates that it generally dominates, and is thus more important than, *am*. These *general dominance designations* are reflected in the *Ranks* column.

The general dominance statistics, whereas simple to interpret, are the least stringent of the dominance statistics/designations as they average across the most sub-models (more specifics provided below). Though least stringent, the general dominance statistics tent to always allow for a comparison between two IVs. The only time a general dominance designation cannot be made between two IVs is when they are exactly equal to one another in terms of predicting the dependent variable (DV)/outcome/response.

General dominance statistics for an IV are also the arithmetic average of that IV’s conditional dominance statistics - these statistics are discussed next.

```
#> Conditional Dominance Statistics:
#> IVs: 1 IVs: 2 IVs: 3
#> am 0.3597989 0.2164938 0.07076149
#> cyl 0.7261800 0.4181185 0.10762967
#> carb 0.3035184 0.1791172 0.05228872
```

This section reports on how the predictive usefulness for any given IV changes as more IVs are added to the model. These statistics describing change as more IVs are added to the model are known as *conditional dominance statistics*. Conditional dominance statistics are more challenging to interpret as they are best evaluated across their trajectory within an IV but across different numbers of IVs in the model. They also do not cleanly decompose the \(R^2\) value like the general dominance statistics but rather shows the variation in \(R^2\) values as more IVs are added to the model.

Conditional dominance statistics are computed as the average incremental contribution to prediction an IV makes within a single order of models–where order refers to a distinct number of IVs in the prediction model. As such, each IV will have \(p\) different conditional dominance statistics.

In the example above, order one is `IVs: 1`

and refers to the incremental contribution an IV makes to predicting the DV when by itself. Order two is `IVs: 2`

and refers to the average incremental contribution an IV makes in predicting the DV when another IV is included in the model. Finally, order three is `IVs: 3`

and refers to the incremental contribution an IV makes beyond the other two IVs.

Conditional dominance statistics provide more information about each IV than general dominance statistics as they reveal the effect that IV redundancy has on prediction for each IV. In the above conditional dominance statistic matrix, *am* is slightly less affected by variable redundancy overall as its average incremental contribution to fit shrinks relatively less than the other two IVs across all three orders.

As with general dominance statistics, conditional dominance statistics can be compared to determine which conditionally dominate others. Conditional dominance is determined for an IV over another when its within-order conditional dominance statistic comparisons are larger than another IV’s statistics for all `p`

orders. If, at any order, the within-order comparison between conditional dominance statistics for two IVs are equal or there is a change rank order between IVs, no conditional dominance designation can be made between those IVs. As applied to the results above, *am* is conditionally dominated by *cyl* as its conditional dominance statistics are smaller than *cyl*’s at order 1, 2, and 3.

The increase in complexity with conditional dominance from general also results in a more stringent set of comparisons. It is more difficult for one IV to conditionally dominate than it is for one IV to generally dominate another IV as there are more ways in which conditional dominance can fail to be achieved. Additionally, conditional dominance implies general dominance–but the reverse is not true. An IV can generally, but not conditionally, dominate another IV.

The next section discusses the most stringent of the dominance designations.

```
#> Complete Dominance Designations:
#> Dmnated?am Dmnated?cyl Dmnated?carb
#> Dmnates?am NA FALSE TRUE
#> Dmnates?cyl TRUE NA TRUE
#> Dmnates?carb FALSE FALSE NA
```

This section reports on how the predictive usefulness for an IV compares with another IV across all models that are comparable for the two IVs. The comparisons described in this section do not form statistics but only dominance designations; these are known as *complete dominance designations*. Conditional dominance designations are not difficult to interpret but are the most complex set of comparisons and only exist as a set of comparisons across IV pairs.

Complete dominance designations are made by comparing all possible incremental contributions to predicting the DV between two IVs for models that are comparable–which include all sub-models that do not include the other IV. For example, the comparison between *am* and *cyl* would include comparing the model with *am* alone to the model with *cyl* alone as well as the model with *am* and *carb* beyond the model with *carb* to the model with *cyl* and *carb* beyond the model with *carb*.

Complete dominance is the most stringent dominance criterion to satisfy as it requires that an IV *always* have a larger increment to to prediction across all, \(2^{p-2}\) comparable individual models between two IVs.

As is noted above, the complete dominance designation has no statistic. The matrix `domin`

returns is a set of logicals that reads from the left to right. A value of `TRUE`

means that the IV in the row completely dominates the IV in the column. Conversely, a value of `FALSE`

means the opposite, that the IV in the row is completely dominated by the IV in the column. A `NA`

value means no complete dominance designation could be made as the comparison IVs’ incremental contributions differ in relative magnitude from model to model.

In the example above, the complete dominance designations show a clear hierarchy among the IVs with `cyl`

completely dominating `am`

which completely dominates `carb`

. Complete dominance implies both general and conditional dominance, but, again, the reverse is not true. An IV can conditionally or generally, but not completely, dominate another IV.

In this section, I expand more on what goes on during a DA and how statistics are computed using an example.

In the example below, I create a custom function that exports model results to an external R object (i.e., the `results`

data frame).

The programming that is involved in the function below is a little complex (how it works and how to develop something like it will be the subject of another vignette!) but the idea is that it captures the formula `domin`

feeds into the `lm`

model and collects the \(R^2\) value the model produces.

```
<-
lm_capture function(x, ...) {
<<- count + 1
count <- lm(x, ...)
y "fmla"] <<- deparse(x)
results[count, "r2"] <<- summary(y)[["r.squared"]]
results[count, return(y)
}
<- 0
count
<- data.frame(fmla = rep("", times = 2^3-1),
results r2 = rep(NA, times = 2^3-1) )
<- domin(mpg ~ am + cyl + carb, lm_capture, list(summary, "r.squared"), data = mtcars)
lm_da
results#> fmla r2
#> 1 mpg ~ am 0.3597989
#> 2 mpg ~ cyl 0.7261800
#> 3 mpg ~ carb 0.3035184
#> 4 mpg ~ am + cyl 0.7590135
#> 5 mpg ~ am + carb 0.7036726
#> 6 mpg ~ cyl + carb 0.7405408
#> 7 mpg ~ am + cyl + carb 0.8113023
```

As the printed result from the *results* data frame shows, `domin`

runs 7 linear regressions with all the different combinations of the IVs and collects the \(R^2\) value associated with each model.

Given the results in the *results* data frame, I will explain first how to determine complete dominance of two IVs–in this case, between *am* and *carb*. Complete dominance (\(D\)) is computed as:

\(X_vDX_z\; if\;2^{p-2}\, =\, \Sigma^{2^{p-2}}_{j=1}{ \{\begin{matrix} if\, R^2_{X_vS2_j}\, - R^2_{S2_j} > R^2_{X_zS2_j}\, - R^2_{S2_j}\,then\, 1\, \\ if\, R^2_{X_vS2_j}\, - R^2_{S2_j} \le R^2_{X_zS2_j}- R^2_{S2_j}\,then\, \,0\end{matrix} }\)

Where \(X_v\) and \(X_z\) are two IVs, \(S2_j\) is a set of the other IVs not including \(X_v\) and \(X_z\) which can include the null set with no other IVs, and \(R^2\) is a model fit statistic (does not have to be an \(R^2\) metric).

When comparing *am* and *carb*, there are 2 comparisons for each IV pair of the 7 model results to be made. Those comparisons are that between *mpg ~ am* and *mpg ~ carb* as well as the difference between *mpg ~ am + cyl* and *mpg ~ cyl + carb* from *mpg ~ cyl*. These comparisons are all the possible submodels in which models containing *am* and *carb* can be directly compared.

The comparisons suggest the complete dominance of *am* as *mpg ~ am*’s *0.3597989* is greater than *mpg ~ carb*’s *0.3035184* and *mpg ~ am + cyl*’s *0.0328335* increment beyond *mpg ~ cyl* is greater than *mpg ~ cyl + carb*’s *0.0143608* increment beyond *mpg ~ cyl*.

This result is reported in row 3-column 1 and row 1-column 3 of the *Complete_Dominance* matrix in the object *domin* returns.

```
$Complete_Dominance
lm_da#> Dmnated_am Dmnated_cyl Dmnated_carb
#> Dmnates_am NA FALSE TRUE
#> Dmnates_cyl TRUE NA TRUE
#> Dmnates_carb FALSE FALSE NA
```

The process used to compare models containing *am* and *carb* are repeated for the comparison between *am* and *cyl* as well as *carb* and *cyl*.

Computing conditional dominance statistics is the next process to be described in this vignette. Conditional dominance statistics are computed as:

\(C^i_{X_v} = \Sigma^{(\begin{matrix}p-1\\i-1\end{matrix})}_{i=1}{\frac{R^2_{X_vS_i}\, - R^2_{S_i}}{(\begin{matrix}p-1\\i-1\end{matrix})}}\)

Where \(S_i\) is a subset of IVs not including \(X_v\) and \((\begin{matrix}p-1\\i-1\end{matrix})\) is the number of distinct combinations produced choosing the number of elements in the bottom value (\(i-1\)) given the number of elements in the top value (\(p-1\))–which is the value that would be produced by `choose(p-1, i-1)`

.

The user can think of conditional dominance statistics as a window or grouped statistic in which a moving average of statistics from the *results* data frame are combined. Specifically, at the specific order in question (i.e., number of IVs in the model), the conditional dominance statistic for that order will be the sum of the models at the order in question with the IV included less the sum of the models at the previous order with the IV omitted. This sum is then divided by the number of models at the order in question.

Put another way, it is the arithmetic average of the increment to the fit statistic for the models including the IV at the order in question.

As applied to the example above involving *am*, the three conditional dominance statistics for *am* are computed as just the value of *am* by itself in the model (i.e., 0.3597989) for order 1, the average increment of *am* beyond the other two IVs (i.e., 0.7590135 - 0.72618 and 0.7036726 - 0.3035184) for order 2, and the increment *am* provides to the full model (i.e., 0.8113023 - 0.7405408).

This result is reported in row 1 of the *Conditional_Dominance* matrix in the object *domin* returns.

```
$Conditional_Dominance
lm_da#> IVs_1 IVs_2 IVs_3
#> am 0.3597989 0.2164938 0.07076149
#> cyl 0.7261800 0.4181185 0.10762967
#> carb 0.3035184 0.1791172 0.05228872
```

Because *am* completely dominates *carb* and is completely dominated by *cyl*, it also conditionally dominates *carb* and is conditionally dominated by *cyl*.

General dominance is computed as:

\(C_{X_v} = \Sigma^p_{i=1}{\frac{C^i_{X_v}}{p}}\)

Thus, the general dominance statistics are the arithmetic average of all the conditional dominance statistics for an IV.

Another way of thinking about the general dominance statistic is as a weighted average of all the fit statistics in the model. This is clearer when substituting the equation for the conditional dominance statistics into the equation for general dominance as each model is weighted both by the number of submodels at the order in which it was estimated (the average taken for conditional dominance statistic computation) as well as the number of IVs in the model total (the average taken for general dominance statistic computation).

The general dominance statistic for *am* is then computed as the average of 0.3597989, 0.2164938, and 0.0707615.

This result is reported in element 1 of the *General_Dominance* vector in the object *domin* returns.

```
$General_Dominance
lm_da#> am cyl carb
#> 0.2156848 0.4173094 0.1783081
```

Because *am* conditionally and completely dominates *carb* as well as is conditionally and completely dominated by *cyl*, it also generally dominates *carb* and is generally dominated by *cyl*.