Factor analysis of mixed data

In statistics, factor analysis of mixed data (FAMD), or factorial analysis of mixed data, is the factorial method devoted to data tables in which a group of individuals is described both by quantitative and qualitative variables. It belongs to the exploratory methods developed by the French school called Analyse des données founded by Jean-Paul Benzécri.

The term mixed refers to the simultaneous presence, as active elements, of quantitative and qualitative variables. Roughly, we can say that FAMD works as a principal components analysis (PCA) for quantitative variables and as a multiple correspondence analysis (MCA) for qualitative variables.

Scope

When data include both types of variables but the active variables being homogeneous, PCA or MCA can be used.

Indeed, it is easy to include supplementary quantitative variables in MCA by the correlation coefficients between the variables and factors on individuals (a factor on individuals is the vector gathering the coordinates of individuals on a factorial axis); the representation obtained is a correlation circle (as in PCA).

Similarly, it is easy to include supplementary categorical variables in PCA.[1] For this, each category is represented by the center of gravity of the individuals who have it (as MCA).

When the active variables are mixed, the usual practice is to perform discretization on the quantitative variables (e.g. usually in surveys the age is transformed in age classes). Data thus obtained can be processed by MCA.

This practice reaches its limits:

• When there are few individuals ( less than a hundred to fix ideas ) in which case the MCA is unstable ;
• When there are few qualitative variables with respect to quantitative variables (one can be reluctant to discretize twenty quantitative variables to take into account a single qualitative variable).

Criterion

The data include ${\displaystyle K}$ quantitative variables ${\displaystyle {k=1,\dots ,K}}$ and ${\displaystyle Q}$ qualitative variables ${\displaystyle {q=1,\dots ,Q}}$ .

${\displaystyle z}$ is a quantitative variable. We note:

• ${\displaystyle r(z,k)}$ the correlation coefficient between variables ${\displaystyle k}$ and ${\displaystyle z}$ ;
• ${\displaystyle \eta ^{2}(z,q)}$ the squared correlation ratio between variables ${\displaystyle z}$ and ${\displaystyle q}$ .

In the PCA of ${\displaystyle K}$, we look for the function on ${\displaystyle I}$ (a function on ${\displaystyle I}$ assigns a value to each individual, it is the case for initial variables and principal components) the most correlated to all ${\displaystyle K}$ variables in the following sense:

${\displaystyle \sum _{k}r^{2}(z,k)}$ maximum.

In MCA of Q, we look for the function on ${\displaystyle I}$ more related to all ${\displaystyle Q}$ variables in the following sense:

${\displaystyle \sum _{q}\eta ^{2}(z,q)}$ maximum.

In FAMD ${\displaystyle \{K,Q\}}$, we look for the function on ${\displaystyle I}$ the more related to all ${\displaystyle K+Q}$ variables in the following sense:

${\displaystyle \sum _{k}r^{2}(z,k)+\sum _{q}\eta ^{2}(z,q)}$ maximum.

In this criterion, both types of variables play the same role. The contribution of each variable in this criterion is bounded by 1.

Plots

The representation of individuals is made directly from factors ${\displaystyle I}$ .

The representation of quantitative variables is constructed as in PCA (correlation circle).

The representation of the categories of qualitative variables is as in MCA : a category is at the centroid of the individuals who possess it. Note that we take the exact centroid and not, as is customary in MCA, the centroid up to a coefficient dependent on the axis (in MCA this coefficient is equal to the inverse of the square root of the eigenvalue; it would be inadequate in FAMD).

The representation of variables is called relationship square. The coordinate of qualitative variable ${\displaystyle j}$ along axis ${\displaystyle s}$ is equal to squared correlation ratio between the variable ${\displaystyle j}$ and the factor of rank ${\displaystyle s}$ (denoted ${\displaystyle \eta ^{2}(j,s)}$). The coordinates of quantitative variable ${\displaystyle k}$ along axis ${\displaystyle s}$ is equal to the squared correlation coefficient between the variable ${\displaystyle k}$ and the factor of rank ${\displaystyle s}$ (denoted ${\displaystyle r^{2}(k,s)}$).

Aids to interpretation

The relationship indicators between the initial variables are combined in a so-called relationship matrix that contains, at the intersection of row ${\displaystyle l}$ and column ${\displaystyle c}$:

• If the variables ${\displaystyle l}$ and ${\displaystyle c}$ are quantitative, the squared correlation coefficient between the variables ${\displaystyle l}$ and ${\displaystyle c}$ ;
• If the variable ${\displaystyle l}$ is qualitative and the variable ${\displaystyle c}$ is quantitative, the squared correlation ratio between ${\displaystyle l}$ and ${\displaystyle c}$;
• If the variables ${\displaystyle l}$ and ${\displaystyle c}$ are qualitative, the indicator ${\displaystyle \phi ^{2}}$ between the variables ${\displaystyle l}$ and ${\displaystyle c}$.

Example

A very small data set (Table 1) illustrates the operation and outputs of the FAMD . Six individuals are described by three quantitative variables and three qualitatives variables. Data were analyzed using the R package function FAMD FactoMineR .

Table 1. Data (test example).
${\displaystyle k_{1}}$ ${\displaystyle k_{2}}$ ${\displaystyle k_{3}}$ ${\displaystyle q_{1}}$ ${\displaystyle q_{2}}$ ${\displaystyle q_{3}}$
${\displaystyle i_{1}}$ 2 4.5 4 ${\displaystyle q_{1}}$-A ${\displaystyle q_{2}}$-B ${\displaystyle q_{3}}$-C
${\displaystyle i_{2}}$ 5 4.5 4 ${\displaystyle q_{1}}$-C ${\displaystyle q_{2}}$-B ${\displaystyle q_{3}}$-C
${\displaystyle i_{3}}$ 3 1 2 ${\displaystyle q_{1}}$-B ${\displaystyle q_{2}}$-B ${\displaystyle q_{3}}$-B
${\displaystyle i_{4}}$ 4 1 2 ${\displaystyle q_{1}}$-B ${\displaystyle q_{2}}$-B ${\displaystyle q_{3}}$-B
${\displaystyle i_{5}}$ 1 1 1 ${\displaystyle q_{1}}$-A ${\displaystyle q_{2}}$-A ${\displaystyle q_{3}}$-A
${\displaystyle i_{6}}$ 6 1 2 ${\displaystyle q_{1}}$-C ${\displaystyle q_{2}}$-A ${\displaystyle q_{3}}$-A
Table 2. Test example. Relationship matrix.
${\displaystyle k_{1}}$ ${\displaystyle k_{2}}$ ${\displaystyle k_{3}}$ ${\displaystyle q_{1}}$ ${\displaystyle q_{2}}$ ${\displaystyle q_{3}}$
${\displaystyle k_{1}}$ 1 0.00 0.05 0.91 0.00 0.00
${\displaystyle k_{2}}$ 0.00 1 0.90 0.25 0.25 1.00
${\displaystyle k_{3}}$ 0.05 0.90 1 0.13 0.40 0.93
${\displaystyle q_{1}}$ 0.91 0.25 0.13 2 0.25 1.00
${\displaystyle q_{2}}$ 0.00 0.25 0.40 0.25 1 1.00
${\displaystyle q_{3}}$ 0.00 1.00 0.93 1.00 1.00 2

In the relationship matrix, the coefficients are equal to ${\displaystyle R^{2}}$ (quantitative variables), ${\displaystyle \phi ^{2}}$ (qualitative variables) or ${\displaystyle \eta ^{2}}$ (one variable of each type).

The matrix shows an entanglement of the relationships between the two types of variables.

The representation of individuals (Figure 1) clearly shows three groups of individuals. The first axis opposes individuals 1 and 2 to all others. The second axis opposes individuals 3 and 4 to individuals 5 and 6.

 Figure1. FAMD. Test example. Representation of individuals. Figure2. FAMD. Test example. Relationship square. Figure3. FAMD. Test example. Correlation circle. Figure4. FAMD. Test example. Representation of the categories of qualitative variables.

The representation of variables (relationship square, Figure 2) shows that the first axis (${\displaystyle F1}$) is closely linked to variables ${\displaystyle k_{2}}$, ${\displaystyle k_{3}}$ and ${\displaystyle Q_{3}}$ . The correlation circle (Figure 3) specifies the sign of the correlation between ${\displaystyle F1}$, ${\displaystyle k_{2}}$ and ${\displaystyle k_{3}}$; the representation of the categories (Figure 4) clarifies the nature of the relationship between ${\displaystyle F1}$ and ${\displaystyle Q_{3}}$. Finally individuals 1 and 2, individualized by the first axis, are characterized by high values of ${\displaystyle k_{2}}$ and ${\displaystyle k_{3}}$ and by the categories ${\displaystyle c}$ of ${\displaystyle Q_{3}}$ as well.

This example illustrates how the FAMD simultaneously analyses of quantitative and qualitative variables. Thus, it shows, in this example, a first dimension based on the two types of variables.

History

The FAMD's original work is due to Brigitte Escofier[2] and Gilbert Saporta.[3] This work was resumed in 2002 by Jérôme Pagès.[4] The most complete presentation of FAMD in English is included in a book of Jérôme Pagès.[5]

Software

The method is implemented in the R package FactoMineR

References

1. ^ Escofier Brigitte & Pagès Jérôme (2008). Analyses factorielles simples et multiples. Dunod. Paris. 318 p. p. 27 et seq.
2. ^ Escofier Brigitte (1979). Traitement simultané de variables quantitatives et qualitatives en analyse factorielle. Les cahiers de l’analyse des données, 4, 2, 137–146. http://archive.numdam.org/ARCHIVE/CAD/CAD_1979__4_2/CAD_1979__4_2_137_0/CAD_1979__4_2_137_0.pdf
3. ^ Saporta Gilbert (1990). Simultaneous analysis of qualitative and quantitative data. Atti della XXXV riunione scientifica ; società italiana di Statistica, 63–72 . http://cedric.cnam.fr/~saporta/SAQQD.pdf
4. ^ Pagès Jérôme (2002). Analyse factorielle de données mixtes. Revue de Statistique appliquée, 52, 4, 93–111 http://archive.numdam.org/ARCHIVE/RSA/RSA_2004__52_4/RSA_2004__52_4_93_0/RSA_2004__52_4_93_0.pdf
5. ^ Pagès Jérôme (2014). Multiple Factor Analysis by Example Using R. Chapman & Hall/CRC The R Series London 272 p