(maxima.info)Functions and Variables for data manipulation
50.2 Functions and Variables for data manipulation
==================================================
-- Function: build_sample
build_sample (<list>)
build_sample (<matrix>)
Builds a sample from a table of absolute frequencies. The input
table can be a matrix or a list of lists, all of them of equal
size. The number of columns or the length of the lists must be
greater than 1. The last element of each row or list is
interpreted as the absolute frequency. The output is always a
sample in matrix form.
Examples:
Univariate frequency table.
(%i1) load ("descriptive")$
(%i2) sam1: build_sample([[6,1], [j,2], [2,1]]);
[ 6 ]
[ ]
[ j ]
(%o2) [ ]
[ j ]
[ ]
[ 2 ]
(%i3) mean(sam1);
2 j + 8
(%o3) [-------]
4
(%i4) barsplot(sam1) $
Multivariate frequency table.
(%i1) load ("descriptive")$
(%i2) sam2: build_sample([[6,3,1], [5,6,2], [u,2,1],[6,8,2]]) ;
[ 6 3 ]
[ ]
[ 5 6 ]
[ ]
[ 5 6 ]
(%o2) [ ]
[ u 2 ]
[ ]
[ 6 8 ]
[ ]
[ 6 8 ]
(%i3) cov(sam2);
[ 2 2 ]
[ u + 158 (u + 28) 2 u + 174 11 (u + 28) ]
[ -------- - --------- --------- - ----------- ]
(%o3) [ 6 36 6 12 ]
[ ]
[ 2 u + 174 11 (u + 28) 21 ]
[ --------- - ----------- -- ]
[ 6 12 4 ]
(%i4) barsplot(sam2, grouping=stacked) $
-- Function: continuous_freq
continuous_freq (<data>)
continuous_freq (<data>, <m>)
The first argument of 'continuous_freq' must be a list or
1-dimensional array (as created by 'make_array') of numbers.
Divides the range in intervals and counts how many values are
inside them. The second argument is optional and either equals the
number of classes we want, 10 by default, or equals a list
containing the class limits and the number of classes we want, or a
list containing only the limits.
If sample values are all equal, this function returns only one
class of amplitude 2.
Examples:
Optional argument indicates the number of classes we want. The
first list in the output contains the interval limits, and the
second the corresponding counts: there are 16 digits inside the
interval '[0, 1.8]', 24 digits in '(1.8, 3.6]', and so on.
(%i1) load ("descriptive")$
(%i2) s1 : read_list (file_search ("pidigits.data"))$
(%i3) continuous_freq (s1, 5);
(%o3) [[0, 1.8, 3.6, 5.4, 7.2, 9.0], [16, 24, 18, 17, 25]]
Optional argument indicates we want 7 classes with limits -2 and
12:
(%i1) load ("descriptive")$
(%i2) s1 : read_list (file_search ("pidigits.data"))$
(%i3) continuous_freq (s1, [-2,12,7]);
(%o3) [[- 2, 0, 2, 4, 6, 8, 10, 12], [8, 20, 22, 17, 20, 13, 0]]
Optional argument indicates we want the default number of classes
with limits -2 and 12:
(%i1) load ("descriptive")$
(%i2) s1 : read_list (file_search ("pidigits.data"))$
(%i3) continuous_freq (s1, [-2,12]);
3 4 11 18 32 39 46 53
(%o3) [[- 2, - -, -, --, --, 5, --, --, --, --, 12],
5 5 5 5 5 5 5 5
[0, 8, 20, 12, 18, 9, 8, 25, 0, 0]]
The first argument may be an array.
(%i1) load ("descriptive")$
(%i2) s1 : read_list (file_search ("pidigits.data"))$
(%i3) a1 : make_array (fixnum, length (s1)) $
(%i4) fillarray (a1, s1);
(%o4) {Lisp Array:
#(3 1 4 1 5 9 2 6 5 3 5 8 9 7 9 3 2 3 8 4 6 2 6 4 3 3 8 3 2 7 9 \
5 0 2 8 8 4 1 9 7 1 6 9 3 9 9 3 7 5 1 0 5 8 2 0 9 7 4 9 4 4 5 9
2 3 0 7 8 1 6 4 0 6 2 8 6 2 0 8 9 9 8 6 2 8 0 3 4 8 2 5 3 4 2 \
1 1 7 0 6 7)}
(%i5) continuous_freq (a1);
9 9 27 18 9 27 63 36 81
(%o5) [[0, --, -, --, --, -, --, --, --, --, 9],
10 5 10 5 2 5 10 5 10
[8, 8, 12, 12, 10, 8, 9, 8, 12, 13]]
-- Function: discrete_freq (<data>)
Counts absolute frequencies in discrete samples, both numeric and
categorical. Its unique argument is a list, or 1-dimensional array
(as created by 'make_array').
(%i1) load ("descriptive")$
(%i2) s1 : read_list (file_search ("pidigits.data"))$
(%i3) discrete_freq (s1);
(%o3) [[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
[8, 8, 12, 12, 10, 8, 9, 8, 12, 13]]
The first list gives the sample values and the second their
absolute frequencies. Commands '? col' and '? transpose' should
help you to understand the last input.
The argument may be an array.
(%i1) load ("descriptive")$
(%i2) s1 : read_list (file_search ("pidigits.data"))$
(%i3) a1 : make_array (fixnum, length (s1)) $
(%i4) fillarray (a1, s1);
(%o4) {Lisp Array:
#(3 1 4 1 5 9 2 6 5 3 5 8 9 7 9 3 2 3 8 4 6 2 6 4 3 3 8 3 2 7 9 \
5 0 2 8 8 4 1 9 7 1 6 9 3 9 9 3 7 5 1 0 5 8 2 0 9 7 4 9 4 4 5 9
2 3 0 7 8 1 6 4 0 6 2 8 6 2 0 8 9 9 8 6 2 8 0 3 4 8 2 5 3 4 2 \
1 1 7 0 6 7)}
(%i5) discrete_freq (a1);
(%o5) [[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
[8, 8, 12, 12, 10, 8, 9, 8, 12, 13]]
-- Function: standardize
standardize (<list>)
standardize (<matrix>)
Subtracts to each element of the list the sample mean and divides
the result by the standard deviation. When the input is a matrix,
'standardize' subtracts to each row the multivariate mean, and then
divides each component by the corresponding standard deviation.
-- Function: subsample
subsample (<data_matrix>, <predicate_function>)
subsample (<data_matrix>, <predicate_function>, <col_num1>,
<col_num2>, ...)
This is a sort of variant of the Maxima 'submatrix' function. The
first argument is the data matrix, the second is a predicate
function and optional additional arguments are the numbers of the
columns to be taken. Its behaviour is better understood with
examples.
These are multivariate records in which the wind speed in the first
meteorological station were greater than 18. See that in the
lambda expression the <i>-th component is referred to as 'v[i]'.
(%i1) load ("descriptive")$
(%i2) s2 : read_matrix (file_search ("wind.data"))$
(%i3) subsample (s2, lambda([v], v[1] > 18));
[ 19.38 15.37 15.12 23.09 25.25 ]
[ ]
[ 18.29 18.66 19.08 26.08 27.63 ]
(%o3) [ ]
[ 20.25 21.46 19.95 27.71 23.38 ]
[ ]
[ 18.79 18.96 14.46 26.38 21.84 ]
In the following example, we request only the first, second and
fifth components of those records with wind speeds greater or equal
than 16 in station number 1 and less than 25 knots in station
number 4. The sample contains only data from stations 1, 2 and 5.
In this case, the predicate function is defined as an ordinary
Maxima function.
(%i1) load ("descriptive")$
(%i2) s2 : read_matrix (file_search ("wind.data"))$
(%i3) g(x):= x[1] >= 16 and x[4] < 25$
(%i4) subsample (s2, g, 1, 2, 5);
[ 19.38 15.37 25.25 ]
[ ]
[ 17.33 14.67 19.58 ]
(%o4) [ ]
[ 16.92 13.21 21.21 ]
[ ]
[ 17.25 18.46 23.87 ]
Here is an example with the categorical variables of 'biomed.data'.
We want the records corresponding to those patients in group 'B'
who are older than 38 years.
(%i1) load ("descriptive")$
(%i2) s3 : read_matrix (file_search ("biomed.data"))$
(%i3) h(u):= u[1] = B and u[2] > 38 $
(%i4) subsample (s3, h);
[ B 39 28.0 102.3 17.1 146 ]
[ ]
[ B 39 21.0 92.4 10.3 197 ]
[ ]
[ B 39 23.0 111.5 10.0 133 ]
[ ]
[ B 39 26.0 92.6 12.3 196 ]
(%o4) [ ]
[ B 39 25.0 98.7 10.0 174 ]
[ ]
[ B 39 21.0 93.2 5.9 181 ]
[ ]
[ B 39 18.0 95.0 11.3 66 ]
[ ]
[ B 39 39.0 88.5 7.6 168 ]
Probably, the statistical analysis will involve only the blood
measures,
(%i1) load ("descriptive")$
(%i2) s3 : read_matrix (file_search ("biomed.data"))$
(%i3) subsample (s3, lambda([v], v[1] = B and v[2] > 38),
3, 4, 5, 6);
[ 28.0 102.3 17.1 146 ]
[ ]
[ 21.0 92.4 10.3 197 ]
[ ]
[ 23.0 111.5 10.0 133 ]
[ ]
[ 26.0 92.6 12.3 196 ]
(%o3) [ ]
[ 25.0 98.7 10.0 174 ]
[ ]
[ 21.0 93.2 5.9 181 ]
[ ]
[ 18.0 95.0 11.3 66 ]
[ ]
[ 39.0 88.5 7.6 168 ]
This is the multivariate mean of 's3',
(%i1) load ("descriptive")$
(%i2) s3 : read_matrix (file_search ("biomed.data"))$
(%i3) mean (s3);
65 B + 35 A 317 6 NA + 8144.999999999999
(%o3) [-----------, ---, 87.178, ------------------------,
100 10 100
3 NA + 19587
18.123, ------------]
100
Here, the first component is meaningless, since 'A' and 'B' are
categorical, the second component is the mean age of individuals in
rational form, and the fourth and last values exhibit some strange
behaviour. This is because symbol 'NA' is used here to indicate
<non available> data, and the two means are nonsense. A possible
solution would be to take out from the matrix those rows with 'NA'
symbols, although this deserves some loss of information.
(%i1) load ("descriptive")$
(%i2) s3 : read_matrix (file_search ("biomed.data"))$
(%i3) g(v):= v[4] # NA and v[6] # NA $
(%i4) mean (subsample (s3, g, 3, 4, 5, 6));
(%o4) [79.4923076923077, 86.2032967032967, 16.93186813186813,
2514
----]
13
-- Function: transform_sample (<matrix>, <varlist>, <exprlist>)
Transforms the sample <matrix>, where each column is called
according to <varlist>, following expressions in <exprlist>.
Examples:
The second argument assigns names to the three columns. With these
names, a list of expressions define the transformation of the
sample.
(%i1) load ("descriptive")$
(%i2) data: matrix([3,2,7],[3,7,2],[8,2,4],[5,2,4]) $
(%i3) transform_sample(data, [a,b,c], [c, a*b, log(a)]);
[ 7 6 log(3) ]
[ ]
[ 2 21 log(3) ]
(%o3) [ ]
[ 4 16 log(8) ]
[ ]
[ 4 10 log(5) ]
Add a constant column and remove the third variable.
(%i1) load ("descriptive")$
(%i2) data: matrix([3,2,7],[3,7,2],[8,2,4],[5,2,4]) $
(%i3) transform_sample(data, [a,b,c], [makelist(1,k,length(data)),a,b]);
[ 1 3 2 ]
[ ]
[ 1 3 7 ]
(%o3) [ ]
[ 1 8 2 ]
[ ]
[ 1 5 2 ]
automatically generated by info2www version 1.2.2.9