Using Subsets

Using Subsets#

Retrieving subsets#

Any subset defined for a Data can be retrieved, or referred to (when calling methods that uses subsets, e.g. data.subset_data(), data.plots()), by its name, and the name of the group it belongs to (if necessary). One can use the get_subests() method to simply retrieve a subset with its “path” (i.e., the combination of a group name and a subset name). For example, to get a subset named “all” in “default” group, use:

s = data.get_subsets('default/all') # path is <group_name>/<subset_name>

For subsets in the “default” group, default/ can be omitted:

s = data.get_subsets('all') # retrieving from the "default" group

You can also provide group and subset names separately. For example:

data.get_subsets(name='all', group='default') # subset 'all' in group 'default'
data.get_subsets(name='my_subset') # get subset 'my_subset' in group 'default' by default; however, if 'my_subset' is not found in group 'default', it is searched in other groups

If a subset is provided, this subset itself will be returned:

data.get_subsets(my_subset) # returns `my_subset` itself

Multiple subsets can be retrieved simultaneously by providing a list (or lists):

data.get_subsets(['default/all', 'mygroup/mysubset'])
data.get_subsets(name=['all', 'mysubset'])
data.get_subsets(name=['all', 'mysubset'], group='default')

Multiple subsets are returned as a list.

Special subsets#

A “special subset” is a virtual subset that is temporarily created when retrieving (or referring to) it. It is never considered a real subset of a Data (and thus cannot be seen with data.subset_summary()) unless explicitly adding it to a Data using data.add_subsets().

A special subset can only be retrieved with its “path” (see retrieving subsets). For example:

data.get_subsets('<special subset path>') # allowed
data.get_subsets(path='<special subset path>') # allowed
data.get_subsets(name='<special subset path>') # ERROR

Tip

“Retrieving” a special subset is actually creating a new subset as if using an existing subset of Data. This can be useful when making plots, as you can specify a special subset (e.g., using the paths argument of data.plots()) without defining it in advance, and it will be automatically created (temporarily).

`$unmasked`#

$unmasked:<col_name> is a special subset that indicates whether an item in a column named <col_name> is not masked. For example:

d = Data(name='example')
d['x'] = [1, 2, -99, 4, -99]
d.mask_missing('x', -99) # values -99 in column 'x' are masked
print(d.t)
s = d.get_subsets('$unmasked:x')
print(s)
print(s.selection) # s.selection is a boolean array indicating whether each row is included in this subset

[mask missing] col 'x': 2/5 (40.00%) masked (value: -99).
 x 
   
---
  1
  2
 --
  4
 --
<Subset '$unmasked(x)' of Data 'example' (3/5)>
[ True  True False  True False]

As seen, the subset '$unmasked:x' includes the three items in column 'x' that are not masked.

Note

For PyTTOP version 0.4.3 and older, the path was in the form of $unmasked/<col_name>. This is supported for compatibility, but may be deprecated in future releases.

`$eval`#

$eval:<expr> is a special subset that evaluates the expression <expr>, similar to using a subset defined as Subset(<expr>) (see details here). For example:

d['x'] = [1, 2, 3, 4, 5]
s = d.get_subsets('$eval:x > 3')
print(s)
print(s.selection) # s.selection is a boolean array indicating whether each row is included in this subset

<Subset 'x > 3' of Data 'example' (2/5)>
[False False False  True  True]

Subset calculations#

For any two subsets of the same Data, the intersection set, union set, and complementary set can be evaluated. This is done using operators &, |, and ~.

# sub1, sub2 are two subsets
sub1 & sub2 # intersection: rows in both `sub1` AND `sub2`
sub1 | sub2 # union: rows in either `sub1` OR `sub2`
~sub1 # complementary: rows NOT in `sub1`

Cutting data given subsets#

You can obtain a data cut given a subset using the subset_data() method (or its alias, subdat()), so that it only contains rows in the specified subset. For example:

d['name'] = ['Alice', 'Bob', 'Carol', 'Dave', 'Eve']
d.add_subsets(Subset([True, False, True, False, True], name='good_guy'))
cut_d = d.subset_data('good_guy')
cut_d.t

Table masked=True length=3

x	name

int64	str5
1	Alice
3	Carol
5	Eve

The rules of referring to subsets in subset_data() is identical to those for Retrieving subsets (get_subsets()).