Using Subsets#
Retrieving subsets#
Any subset defined for a Data can be retrieved, or referred to (when calling methods that uses subsets, e.g. data.subset_data(), data.plots()), by its name, and the name of the group it belongs to (if necessary).
One can use the get_subests() method to simply retrieve a subset with its “path” (i.e., the combination of a group name and a subset name). For example, to get a subset named “all” in “default” group, use:
s = data.get_subsets('default/all') # path is <group_name>/<subset_name>
For subsets in the “default” group, default/ can be omitted:
s = data.get_subsets('all') # retrieving from the "default" group
You can also provide group and subset names separately. For example:
data.get_subsets(name='all', group='default') # subset 'all' in group 'default'
data.get_subsets(name='my_subset') # get subset 'my_subset' in group 'default' by default; however, if 'my_subset' is not found in group 'default', it is searched in other groups
If a subset is provided, this subset itself will be returned:
data.get_subsets(my_subset) # returns `my_subset` itself
Multiple subsets can be retrieved simultaneously by providing a list (or lists):
data.get_subsets(['default/all', 'mygroup/mysubset'])
data.get_subsets(name=['all', 'mysubset'])
data.get_subsets(name=['all', 'mysubset'], group='default')
Multiple subsets are returned as a list.
Special subsets#
A “special subset” is a virtual subset that is temporarily created when retrieving (or referring to) it. It is never considered a real subset of a Data (and thus cannot be seen with data.subset_summary()) unless explicitly adding it to a Data using data.add_subsets().
A special subset can only be retrieved with its “path” (see retrieving subsets). For example:
data.get_subsets('<special subset path>') # allowed
data.get_subsets(path='<special subset path>') # allowed
data.get_subsets(name='<special subset path>') # ERROR
Tip
“Retrieving” a special subset is actually creating a new subset as if using an existing subset of Data. This can be useful when making plots, as you can specify a special subset (e.g., using the paths argument of data.plots()) without defining it in advance, and it will be automatically created (temporarily).
$unmasked#
$unmasked:<col_name> is a special subset that indicates whether an item in a column named <col_name> is not masked. For example:
d = Data(name='example')
d['x'] = [1, 2, -99, 4, -99]
d.mask_missing('x', -99) # values -99 in column 'x' are masked
print(d.t)
s = d.get_subsets('$unmasked:x')
print(s)
print(s.selection) # s.selection is a boolean array indicating whether each row is included in this subset
[mask missing] col 'x': 2/5 (40.00%) masked (value: -99).
x
---
1
2
--
4
--
<Subset '$unmasked(x)' of Data 'example' (3/5)>
[ True True False True False]
As seen, the subset '$unmasked:x' includes the three items in column 'x' that are not masked.
Note
For PyTTOP version 0.4.3 and older, the path was in the form of $unmasked/<col_name>. This is supported for compatibility, but may be deprecated in future releases.
$eval#
$eval:<expr> is a special subset that evaluates the expression <expr>, similar to using a subset defined as Subset(<expr>) (see details here). For example:
d['x'] = [1, 2, 3, 4, 5]
s = d.get_subsets('$eval:x > 3')
print(s)
print(s.selection) # s.selection is a boolean array indicating whether each row is included in this subset
<Subset 'x > 3' of Data 'example' (2/5)>
[False False False True True]
Subset calculations#
For any two subsets of the same Data, the intersection set, union set, and complementary set can be evaluated. This is done using operators &, |, and ~.
# sub1, sub2 are two subsets
sub1 & sub2 # intersection: rows in both `sub1` AND `sub2`
sub1 | sub2 # union: rows in either `sub1` OR `sub2`
~sub1 # complementary: rows NOT in `sub1`
Cutting data given subsets#
You can obtain a data cut given a subset using the subset_data() method (or its alias, subdat()), so that it only contains rows in the specified subset. For example:
d['name'] = ['Alice', 'Bob', 'Carol', 'Dave', 'Eve']
d.add_subsets(Subset([True, False, True, False, True], name='good_guy'))
cut_d = d.subset_data('good_guy')
cut_d.t
| x | name |
|---|---|
| int64 | str5 |
| 1 | Alice |
| 3 | Carol |
| 5 | Eve |
The rules of referring to subsets in subset_data() is identical to those for Retrieving subsets (get_subsets()).