(maxima.info)Functions and Variables for data manipulation


50.2 Functions and Variables for data manipulation
==================================================

 -- Function: build_sample
          build_sample (<list>)
          build_sample (<matrix>)

     Builds a sample from a table of absolute frequencies.  The input
     table can be a matrix or a list of lists, all of them of equal
     size.  The number of columns or the length of the lists must be
     greater than 1.  The last element of each row or list is
     interpreted as the absolute frequency.  The output is always a
     sample in matrix form.

     Examples:

     Univariate frequency table.

          (%i1) load ("descriptive")$
          (%i2) sam1: build_sample([[6,1], [j,2], [2,1]]);
                                 [ 6 ]
                                 [   ]
                                 [ j ]
          (%o2)                  [   ]
                                 [ j ]
                                 [   ]
                                 [ 2 ]
          (%i3) mean(sam1);
                                2 j + 8
          (%o3)                [-------]
                                   4
          (%i4) barsplot(sam1) $

     Multivariate frequency table.

          (%i1) load ("descriptive")$
          (%i2) sam2: build_sample([[6,3,1], [5,6,2], [u,2,1],[6,8,2]]) ;
                                     [ 6  3 ]
                                     [      ]
                                     [ 5  6 ]
                                     [      ]
                                     [ 5  6 ]
          (%o2)                      [      ]
                                     [ u  2 ]
                                     [      ]
                                     [ 6  8 ]
                                     [      ]
                                     [ 6  8 ]
          (%i3) cov(sam2);
                 [   2                 2                            ]
                 [  u  + 158   (u + 28)     2 u + 174   11 (u + 28) ]
                 [  -------- - ---------    --------- - ----------- ]
          (%o3)  [     6          36            6           12      ]
                 [                                                  ]
                 [ 2 u + 174   11 (u + 28)            21            ]
                 [ --------- - -----------            --            ]
                 [     6           12                 4             ]
          (%i4) barsplot(sam2, grouping=stacked) $

 -- Function: continuous_freq
          continuous_freq (<data>)
          continuous_freq (<data>, <m>)

     The first argument of 'continuous_freq' must be a list or
     1-dimensional array (as created by 'make_array') of numbers.
     Divides the range in intervals and counts how many values are
     inside them.  The second argument is optional and either equals the
     number of classes we want, 10 by default, or equals a list
     containing the class limits and the number of classes we want, or a
     list containing only the limits.

     If sample values are all equal, this function returns only one
     class of amplitude 2.

     Examples:

     Optional argument indicates the number of classes we want.  The
     first list in the output contains the interval limits, and the
     second the corresponding counts: there are 16 digits inside the
     interval '[0, 1.8]', 24 digits in '(1.8, 3.6]', and so on.

          (%i1) load ("descriptive")$
          (%i2) s1 : read_list (file_search ("pidigits.data"))$
          (%i3) continuous_freq (s1, 5);
          (%o3) [[0, 1.8, 3.6, 5.4, 7.2, 9.0], [16, 24, 18, 17, 25]]

     Optional argument indicates we want 7 classes with limits -2 and
     12:

          (%i1) load ("descriptive")$
          (%i2) s1 : read_list (file_search ("pidigits.data"))$
          (%i3) continuous_freq (s1, [-2,12,7]);
          (%o3) [[- 2, 0, 2, 4, 6, 8, 10, 12], [8, 20, 22, 17, 20, 13, 0]]

     Optional argument indicates we want the default number of classes
     with limits -2 and 12:

          (%i1) load ("descriptive")$
          (%i2) s1 : read_list (file_search ("pidigits.data"))$
          (%i3) continuous_freq (s1, [-2,12]);
                          3  4  11  18     32  39  46  53
          (%o3)  [[- 2, - -, -, --, --, 5, --, --, --, --, 12],
                          5  5  5   5      5   5   5   5
                         [0, 8, 20, 12, 18, 9, 8, 25, 0, 0]]

     The first argument may be an array.

          (%i1) load ("descriptive")$
          (%i2) s1 : read_list (file_search ("pidigits.data"))$
          (%i3) a1 : make_array (fixnum, length (s1)) $
          (%i4) fillarray (a1, s1);
          (%o4) {Lisp Array:
          #(3 1 4 1 5 9 2 6 5 3 5 8 9 7 9 3 2 3 8 4 6 2 6 4 3 3 8 3 2 7 9 \
          5 0 2 8 8 4 1 9 7 1 6 9 3 9 9 3 7 5 1 0 5 8 2 0 9 7 4 9 4 4 5 9
            2 3 0 7 8 1 6 4 0 6 2 8 6 2 0 8 9 9 8 6 2 8 0 3 4 8 2 5 3 4 2 \
          1 1 7 0 6 7)}
          (%i5) continuous_freq (a1);
                     9   9  27  18  9  27  63  36  81
          (%o5) [[0, --, -, --, --, -, --, --, --, --, 9],
                     10  5  10  5   2  5   10  5   10
                                       [8, 8, 12, 12, 10, 8, 9, 8, 12, 13]]

 -- Function: discrete_freq (<data>)
     Counts absolute frequencies in discrete samples, both numeric and
     categorical.  Its unique argument is a list, or 1-dimensional array
     (as created by 'make_array').

          (%i1) load ("descriptive")$
          (%i2) s1 : read_list (file_search ("pidigits.data"))$
          (%i3) discrete_freq (s1);
          (%o3) [[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
                                       [8, 8, 12, 12, 10, 8, 9, 8, 12, 13]]

     The first list gives the sample values and the second their
     absolute frequencies.  Commands '? col' and '? transpose' should
     help you to understand the last input.

     The argument may be an array.

          (%i1) load ("descriptive")$
          (%i2) s1 : read_list (file_search ("pidigits.data"))$
          (%i3) a1 : make_array (fixnum, length (s1)) $
          (%i4) fillarray (a1, s1);
          (%o4) {Lisp Array:
          #(3 1 4 1 5 9 2 6 5 3 5 8 9 7 9 3 2 3 8 4 6 2 6 4 3 3 8 3 2 7 9 \
          5 0 2 8 8 4 1 9 7 1 6 9 3 9 9 3 7 5 1 0 5 8 2 0 9 7 4 9 4 4 5 9
            2 3 0 7 8 1 6 4 0 6 2 8 6 2 0 8 9 9 8 6 2 8 0 3 4 8 2 5 3 4 2 \
          1 1 7 0 6 7)}
          (%i5) discrete_freq (a1);
          (%o5) [[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
                                       [8, 8, 12, 12, 10, 8, 9, 8, 12, 13]]

 -- Function: standardize
          standardize (<list>)
          standardize (<matrix>)

     Subtracts to each element of the list the sample mean and divides
     the result by the standard deviation.  When the input is a matrix,
     'standardize' subtracts to each row the multivariate mean, and then
     divides each component by the corresponding standard deviation.

 -- Function: subsample
          subsample (<data_matrix>, <predicate_function>)
          subsample (<data_matrix>, <predicate_function>, <col_num1>,
          <col_num2>, ...)

     This is a sort of variant of the Maxima 'submatrix' function.  The
     first argument is the data matrix, the second is a predicate
     function and optional additional arguments are the numbers of the
     columns to be taken.  Its behaviour is better understood with
     examples.

     These are multivariate records in which the wind speed in the first
     meteorological station were greater than 18.  See that in the
     lambda expression the <i>-th component is referred to as 'v[i]'.
          (%i1) load ("descriptive")$
          (%i2) s2 : read_matrix (file_search ("wind.data"))$
          (%i3) subsample (s2, lambda([v], v[1] > 18));
                        [ 19.38  15.37  15.12  23.09  25.25 ]
                        [                                   ]
                        [ 18.29  18.66  19.08  26.08  27.63 ]
          (%o3)         [                                   ]
                        [ 20.25  21.46  19.95  27.71  23.38 ]
                        [                                   ]
                        [ 18.79  18.96  14.46  26.38  21.84 ]

     In the following example, we request only the first, second and
     fifth components of those records with wind speeds greater or equal
     than 16 in station number 1 and less than 25 knots in station
     number 4.  The sample contains only data from stations 1, 2 and 5.
     In this case, the predicate function is defined as an ordinary
     Maxima function.
          (%i1) load ("descriptive")$
          (%i2) s2 : read_matrix (file_search ("wind.data"))$
          (%i3) g(x):= x[1] >= 16 and x[4] < 25$
          (%i4) subsample (s2, g, 1, 2, 5);
                               [ 19.38  15.37  25.25 ]
                               [                     ]
                               [ 17.33  14.67  19.58 ]
          (%o4)                [                     ]
                               [ 16.92  13.21  21.21 ]
                               [                     ]
                               [ 17.25  18.46  23.87 ]

     Here is an example with the categorical variables of 'biomed.data'.
     We want the records corresponding to those patients in group 'B'
     who are older than 38 years.
          (%i1) load ("descriptive")$
          (%i2) s3 : read_matrix (file_search ("biomed.data"))$
          (%i3) h(u):= u[1] = B and u[2] > 38 $
          (%i4) subsample (s3, h);
                          [ B  39  28.0  102.3  17.1  146 ]
                          [                               ]
                          [ B  39  21.0  92.4   10.3  197 ]
                          [                               ]
                          [ B  39  23.0  111.5  10.0  133 ]
                          [                               ]
                          [ B  39  26.0  92.6   12.3  196 ]
          (%o4)           [                               ]
                          [ B  39  25.0  98.7   10.0  174 ]
                          [                               ]
                          [ B  39  21.0  93.2   5.9   181 ]
                          [                               ]
                          [ B  39  18.0  95.0   11.3  66  ]
                          [                               ]
                          [ B  39  39.0  88.5   7.6   168 ]

     Probably, the statistical analysis will involve only the blood
     measures,
          (%i1) load ("descriptive")$
          (%i2) s3 : read_matrix (file_search ("biomed.data"))$
          (%i3) subsample (s3, lambda([v], v[1] = B and v[2] > 38),
                           3, 4, 5, 6);
                             [ 28.0  102.3  17.1  146 ]
                             [                        ]
                             [ 21.0  92.4   10.3  197 ]
                             [                        ]
                             [ 23.0  111.5  10.0  133 ]
                             [                        ]
                             [ 26.0  92.6   12.3  196 ]
          (%o3)              [                        ]
                             [ 25.0  98.7   10.0  174 ]
                             [                        ]
                             [ 21.0  93.2   5.9   181 ]
                             [                        ]
                             [ 18.0  95.0   11.3  66  ]
                             [                        ]
                             [ 39.0  88.5   7.6   168 ]

     This is the multivariate mean of 's3',
          (%i1) load ("descriptive")$
          (%i2) s3 : read_matrix (file_search ("biomed.data"))$
          (%i3) mean (s3);
                 65 B + 35 A  317          6 NA + 8144.999999999999
          (%o3) [-----------, ---, 87.178, ------------------------,
                     100      10                     100
                                                              3 NA + 19587
                                                      18.123, ------------]
                                                                  100

     Here, the first component is meaningless, since 'A' and 'B' are
     categorical, the second component is the mean age of individuals in
     rational form, and the fourth and last values exhibit some strange
     behaviour.  This is because symbol 'NA' is used here to indicate
     <non available> data, and the two means are nonsense.  A possible
     solution would be to take out from the matrix those rows with 'NA'
     symbols, although this deserves some loss of information.
          (%i1) load ("descriptive")$
          (%i2) s3 : read_matrix (file_search ("biomed.data"))$
          (%i3) g(v):= v[4] # NA and v[6] # NA $
          (%i4) mean (subsample (s3, g, 3, 4, 5, 6));
          (%o4) [79.4923076923077, 86.2032967032967, 16.93186813186813,
                                                                      2514
                                                                      ----]
                                                                       13

 -- Function: transform_sample (<matrix>, <varlist>, <exprlist>)

     Transforms the sample <matrix>, where each column is called
     according to <varlist>, following expressions in <exprlist>.

     Examples:

     The second argument assigns names to the three columns.  With these
     names, a list of expressions define the transformation of the
     sample.

          (%i1) load ("descriptive")$
          (%i2) data: matrix([3,2,7],[3,7,2],[8,2,4],[5,2,4]) $
          (%i3) transform_sample(data, [a,b,c], [c, a*b, log(a)]);
                                         [ 7  6   log(3) ]
                                         [               ]
                                         [ 2  21  log(3) ]
          (%o3)                          [               ]
                                         [ 4  16  log(8) ]
                                         [               ]
                                         [ 4  10  log(5) ]

     Add a constant column and remove the third variable.

          (%i1) load ("descriptive")$
          (%i2) data: matrix([3,2,7],[3,7,2],[8,2,4],[5,2,4]) $
          (%i3) transform_sample(data, [a,b,c], [makelist(1,k,length(data)),a,b]);
                                            [ 1  3  2 ]
                                            [         ]
                                            [ 1  3  7 ]
          (%o3)                             [         ]
                                            [ 1  8  2 ]
                                            [         ]
                                            [ 1  5  2 ]
automatically generated by info2www version 1.2.2.9