# Modal

MOdels for Data Analysis and Learning

### Sidebar

• tel: +33 3 20 43 68 76
• Team Assistant
• tel: +33 3 59 57 78 45

Research Organizations

Current Collaborations

Related Inria teams

mixtcomp

# Usage

The demonstrator is located at https://modal-research.lille.inria.fr/BigStat. To work on your own data, you first need to create an account. Its use is pretty straightforward: you zip three files, wait for the processing, then get back a zip file containing a .RData file that is readable by R. Note that this platform is currently in beta.

The three files to be uploaded in a zip file depend on the mode:

learn predict
data.csv data.csv
descriptor.csv descriptor.csv
param.ini output.RData

## Modes

The demonstrator has two modes of operation for MixtComp, “learn” and “predict”.

• In learning, the parameters of the mixtures are estimated.
• In prediction only the missing values (including latent class) are estimated, using parameter estimated from a previous learning.

## Input format

In this section you will find a description of the syntax, and in a following section some examples of test files.

### descriptor.csv (learn and prediction)

The descriptor file contains on the first line the names of the variables, and on the second line the name of the models to be applied. Currently three models can be applied:

• Categorical_pjk
• Gaussian_sjk
• Poisson_k

If no information on the latent class is provided, the code runs in unsupervised mode. However, semi / fully supervised computations can be carried out by providing a z_class variable. In that case, its model must be LatentClass.

descriptor.csv
categorical1;categorical2;categorical3;gaussian1;gaussian2;gaussian3;poisson1;poisson2;poisson3;z_class
Categorical_pjk;Categorical_pjk;Categorical_pjk;Gaussian_sjk;Gaussian_sjk;Gaussian_sjk;Poisson_k;Poisson_k;Poisson_k;LatentClass

### data.csv (learn and prediction)

#### Data format

• categorical data must be coded as contiguous integers, with the first modality coded as 1

#### example

data.csv
categorical1;categorical2;categorical3;gaussian1;gaussian2;gaussian3;poisson1;poisson2;poisson3;z_class
4;1;4;-0.3580979364;-0.3021767542;-0.1075462398;16;4;9;{2}
2;1;4;-0.3365602931;-0.1935100901;-0.2883606085;13;4;11;?
2;1;2;-0.2257433203;-0.3339290504;0.0209779879;21;7;14;2

#### allowed missing value types for each model

 Categorical_pjk Gaussian_sjk Poisson_k LatentClass X X X X X X X X X

### param.ini (learn only)

Will contain all the runtime parameters. At the moment, only contains the number of classes asked, as a nbCluster parameter.

param.ini
nbCluster = 2

### output.RData (prediction only)

A file obtained as a result of a “learn” run, that contains a description of the estimated parameters used for the prediction. This file is in binary format and does not need not be edited by the user.

# Test case files

The syntax of the file should be respected, with ; delimiters, no quotes for strings.

• Here is an archive containing three files that you can use to test the “learn” mode of the demonstrator: datalearn.zip. It contains an heterogeneous set of models (multinomial, Poisson and Gaussian).
• Here is an archive containing three files that you can use to test the “predict” mode of the demonstrator: datapredict.zip. The parameters in the output.RData file have been estimated from the learning set above.

# Output structure

The result is downloaded as an RData file containing a named list res. There is a hierarchy of elements. For example, if you want to access the parameters of the categorical1 data, you would do it as res$variable$param$categorical1$stat where you will find a table of the form:

 expectation q 2.5% q 97.5% 0.3 0.25 0.35 0.7 0.69 0.71 0.6 0.54 0.63 0.4 0.35 0.41

If you look at the parameters for a gaussian variable, for example at res$variable$param$gaussian1$stat, you will find a table of the form:

 expectation q 2.5% q 97.5% 3. 2.9 3.1 0.7 0.69 0.75 4. 3.95 4.1 0.4 0.25 0.56

Which contains the various parameters. The expectation and quantiles correspond to the estimation performed during the SEM algorithm. The row labels should be self explanatory for the various types of models.

res
strategy
nbTrialInInit
nbBurnInIter
nbIter
nbGibbsBurnInIter
nbGibbsIter
mixture
nbCluster
nbFreeParameters
lnObservedLikelihood
lnSemiCompletedLikelihood
lnCompletedLikelihood
BIC
ICL
runTime
nbSample
warnLog
variable
data
z_class
completed !!! <- imputed classes
stat !!! <- a posteriori distribution of class for each individual (= p(z_i / x_i))
categorical1
completed
stat
categorical2, etc ...
param
z_class
stat !!! <- model proportions and quantiles
log
categorical1
stat
log
categorical2, etc ...

Note that the z_class variable contains all the information pertaining to the latent classes:

• res$variable$data$sample$completed contains the imputation for the class, $\hat{z}_i$
• res$variable$data$sample$stat contains the estimated a posteriori probabilities, $\hat{t}_{ik}$
• res$variable$param$z_class$stat contains the proportions, $\hat{\pi}_k$