Modal

MOdels for Data Analysis and Learning

User Tools

Site Tools


Sidebar

Navigation


Contact us


Research Organizations


Current Collaborations


Related Inria teams


Website Administrator

© 2016-2018 Modal-Team. All rights reserved.

mixtcomp

Usage

The demonstrator is located at https://modal-research.lille.inria.fr/BigStat. To work on your own data, you first need to create an account. Its use is pretty straightforward: you zip three files, wait for the processing, then get back a zip file containing a .RData file that is readable by R. Note that this platform is currently in beta.

The three files to be uploaded in a zip file depend on the mode:

learn predict
data.csv data.csv
descriptor.csv descriptor.csv
param.ini output.RData

Modes

The demonstrator has two modes of operation for MixtComp, “learn” and “predict”.

  • In learning, the parameters of the mixtures are estimated.
  • In prediction only the missing values (including latent class) are estimated, using parameter estimated from a previous learning.

Input format

In this section you will find a description of the syntax, and in a following section some examples of test files.

descriptor.csv (learn and prediction)

The descriptor file contains on the first line the names of the variables, and on the second line the name of the models to be applied. Currently three models can be applied:

  • Categorical_pjk
  • Gaussian_sjk
  • Poisson_k

If no information on the latent class is provided, the code runs in unsupervised mode. However, semi / fully supervised computations can be carried out by providing a z_class variable. In that case, its model must be LatentClass.

descriptor.csv
categorical1;categorical2;categorical3;gaussian1;gaussian2;gaussian3;poisson1;poisson2;poisson3;z_class
Categorical_pjk;Categorical_pjk;Categorical_pjk;Gaussian_sjk;Gaussian_sjk;Gaussian_sjk;Poisson_k;Poisson_k;Poisson_k;LatentClass

data.csv (learn and prediction)

Data format

  • categorical data must be coded as contiguous integers, with the first modality coded as 1

example

data.csv
categorical1;categorical2;categorical3;gaussian1;gaussian2;gaussian3;poisson1;poisson2;poisson3;z_class
4;1;4;-0.3580979364;-0.3021767542;-0.1075462398;16;4;9;{2}
2;1;4;-0.3365602931;-0.1935100901;-0.2883606085;13;4;11;?
2;1;2;-0.2257433203;-0.3339290504;0.0209779879;21;7;14;2

allowed missing value types for each model

Categorical_pjk Gaussian_sjk Poisson_k LatentClass
$?$ (completely missing) X X X X
$\{a,b,c\}$ (finite number of values authorized) X X
$[a:b]$ (bounded interval) X
$[-inf:b]$ (semi-bounded interval) X
$[a:+inf]$ (semi-bounded interval) X

param.ini (learn only)

Will contain all the runtime parameters. At the moment, only contains the number of classes asked, as a nbCluster parameter.

param.ini
nbCluster = 2

output.RData (prediction only)

A file obtained as a result of a “learn” run, that contains a description of the estimated parameters used for the prediction. This file is in binary format and does not need not be edited by the user.

Test case files

The syntax of the file should be respected, with ; delimiters, no quotes for strings.

  • Here is an archive containing three files that you can use to test the “learn” mode of the demonstrator: datalearn.zip. It contains an heterogeneous set of models (multinomial, Poisson and Gaussian).
  • Here is an archive containing three files that you can use to test the “predict” mode of the demonstrator: datapredict.zip. The parameters in the output.RData file have been estimated from the learning set above.

Output structure

The result is downloaded as an RData file containing a named list res. There is a hierarchy of elements. For example, if you want to access the parameters of the categorical1 data, you would do it as res$variable$param$categorical1$stat where you will find a table of the form:

expectation q 2.5% q 97.5%
k: 1, modality: 1 0.3 0.25 0.35
k: 1, modality: 2 0.7 0.69 0.71
k: 2, modality: 1 0.6 0.54 0.63
k: 2, modality: 2 0.4 0.35 0.41

If you look at the parameters for a gaussian variable, for example at res$variable$param$gaussian1$stat, you will find a table of the form:

expectation q 2.5% q 97.5%
k: 1, mean 3. 2.9 3.1
k: 1, sd 0.7 0.69 0.75
k: 2, mean 4. 3.95 4.1
k: 2, sd 0.4 0.25 0.56

Which contains the various parameters. The expectation and quantiles correspond to the estimation performed during the SEM algorithm. The row labels should be self explanatory for the various types of models.

res
    strategy
        nbTrialInInit
        nbBurnInIter
        nbIter
        nbGibbsBurnInIter
        nbGibbsIter
    mixture
        nbCluster
        nbFreeParameters
        lnObservedLikelihood
        lnSemiCompletedLikelihood
        lnCompletedLikelihood
        BIC
        ICL
        runTime
        nbSample
        warnLog
    variable
        data
            z_class
                completed !!! <- imputed classes
                stat !!! <- a posteriori distribution of class for each individual (= p(z_i / x_i))
            categorical1
                completed
                stat
            categorical2, etc ...
        param
            z_class
                stat !!! <- model proportions and quantiles
                log
            categorical1
                stat
                log
            categorical2, etc ...

Note that the z_class variable contains all the information pertaining to the latent classes:

  • res$variable$data$sample$completed contains the imputation for the class, $\hat{z}_i$
  • res$variable$data$sample$stat contains the estimated a posteriori probabilities, $\hat{t}_{ik}$
  • res$variable$param$z_class$stat contains the proportions, $\hat{\pi}_k$

Complementary information

mixtcomp.txt · Last modified: 2015/07/03 13:12 by kubicki