Defining Subsets and Subset Groups#

One can define subsets and groups of subsets in PyTTOP, as we will cover in this section.

Defining subsets#

To use the subset feature, Subset, let us import it from pyttop.table:

from pyttop.table import Data, Subset

d = Data(name='example')
d['x'] = [1, 2, 3, 4, 5]
d['object type'] = ['meal', 'meal', 'tea', 'coffee', 'tea']

From expressions#

The most straightforward way to define a subset is by using the add_subsets() method and specifying an expression:

sub1 = d.add_subsets(Subset('x > 3'))
print(sub1)
<Subset 'x > 3' of Data 'example' (2/5)>

In the above example, we have defined a subset named 'x > 3' in our Data. A subset can be retrieved using get_subsets():

d.get_subsets('x > 3')
<Subset 'x > 3' of Data 'example' (2/5)>

Tip

In principle, you may instantiate a subset outside add_subsets(), e.g. sub1 = Subset('x > 3'). However, this does not evaluate the provided expression (or function, as in the next section), and is not considered a subset of any Data. You will have to add it to Data with d.add_subsets(sub1) to make it a real subset of your Data.

Multiple subsets can be added simultaneously. For example:

sub1, sub2 = d.add_subsets(
    Subset('x > 3'),
    Subset('$(object type) == "tea"')
    )

Tip

You can use any expression that can be evaluated by Data.eval(): see documentation here for details.

From functions#

A subset can also be defined given a function. For example:

sub1 = d.add_subsets(
    Subset(lambda t: t['x'] > 3)
    )

The above example defines a subset where d.t['x'] > 3 using a lambda function. For more complex situations, you can define a function:

def select_func(t): # astropy.table.Table -> boolean array
    select = t['x'] > 3
    return select # this should be boolean

sub1 = d.add_subsets(
    Subset(select_func)
    )

Technically, the function should take an astropy.table.Table as input, and returns a boolean array (i.e., an array where the elements are either True or False).

From array-like objects#

Finally, you can directly provide a boolean array, or simply a sequence of True or False values. For example:

sub1, sub1_1, sub1_2 = d.add_subsets(
    Subset(d['x'] > 3),
    Subset([False, False, False, True, True]),
    Subset([0, 0, 0, 1, 1]), # the values will be converted to boolean values
    )

Name, expression, and label#

Subsets have several properties, including name, expression, and label.

The name is used to access a subset:

sub1 = d.get_subsets('x > 3') # gets a subset named 'x > 3'

The character ‘/’ (slash) is not allowed in a subset name.

The expression provides a hint of how the subset have been generated. If a subset has been defined with an expression, the default expression property is simply the expression:

sub1.expression
'x > 3'

The label is the text use to indicate the subset in your plots. When you make a plot for data in a certain subset, the subset label can be automatically added in legends or plot titles (see documentation for plot making and tutorials for examples).

The properties can be manually set when defining a subset:

d.add_subsets(
    Subset('x >= 3', name='xgeq3', expression='x >= 3', label='$x \\geq 3$'))

Convenience methods#

There are also convenient ways to define a subset, as illustrated by examples below:

d.add_subsets(
    Subset.by_range(col1=[0, 1], col2=[0, np.inf]) # defines a subset requiring `(0 < col1 < 1) & (col2 > 0)`
    Subset.by_value('col1', 'value') # defines a subset requiring the value of the column named 'col1' equals 'value'
    )

Note

When a subset is created from a masked array (e.g., created by evaluating columns with masked values), the masked values are NEVER considered elements of the subset. See this page for discussions, demonstrations and caveats.

Subset groups#

By default, the subset is added to the “default” subset group when calling add_subsets(). Alternatively, a subset can be added to a specified subset group. For example:

d.add_subsets(
    Subset('x > 3'),
    Subset('$(object type) == "tea"'),
    group='mygroup',
    )

The above example adds two subsets to a group named ‘mygroup’ (creating it if it does not already exist).

Convenience methods for creating groups of subsets#

A subset group can be particularly useful when several subsets are related. PyTTOP offers convenient methods for quickly creating groups of subsets. For example, the following code defines a subset group named ‘x’, which includes 2 subsets, 0 < x < 3 and 3 < x < 5:

d.subset_group_from_ranges(column='x', ranges=[[0, 3], [3, 5]])
[<Subset 'x(0-3)' of Data 'example' (2/5)>,
 <Subset 'x(3-5)' of Data 'example' (1/5)>]

This can be used to divide a column of continuous value into several bins.

A subset group can also be generated for each unique value in a column. This is typically useful when a column represents a classification. For example, the following example creates a group named ‘object type’ with several subsets, each representing a specific value of ‘object type’.`

d.subset_group_from_values('object type')

Managing subsets and groups#

Summary of subsets#

To get a summary of all subsets defined for a Data, call the subset_summary() method (or the abbreviation, subsum()):

d.subset_summary()
Table length=10
groupnamesizefractionexpressionlabel
str11str23int64float64str42str23
$unmasked--1nan<special subsets: item in col unmasked>-
$eval--1nan<special subsets: rows satisfy expression>-
defaultall51.0allAll
defaultx > 320.4x > 3x > 3
default$(object type) == "tea"20.4$(object type) == "tea"$(object type) == "tea"
xx(0-3)20.4(x > 0) & (x < 3)x$\in$(0, 3)
xx(3-5)10.2(x > 3) & (x < 5)x$\in$(3, 5)
object typeobject type=coffee10.2object type=coffeecoffee
object typeobject type=meal20.4object type=mealmeal
object typeobject type=tea20.4object type=teatea

As can be seen, this returns a table with each row representing a subset. For reference, the columns of the table are summarized below:

  • group: The name of the subset group.

  • name: The name of the subset.

  • size: The size of this subset, i.e. number of rows selected.

  • fraction: The fraction of this subset relative to the whole Data table, i.e. size/len(<data>).

  • expression: A hint of the expression of the selection that specifies this subset.

  • label: The string that is used for this subset in plots (e.g. labels in legends and titles).

Note that subset groups starting with $ indicate the virtual special subsets. See documentation on special subsets for details.

Clearing subsets#

You may clear the subsets in a group by:

d.clear_subsets('x') # clears subset group named 'x'
d.clear_subsets() # clears ALL subsets

There is always a subset in the ‘default’ group named ‘all’, which includes all rows:

d.subsum()
Table length=3
groupnamesizefractionexpressionlabel
str9str3int64float64str42str3
$unmasked--1nan<special subsets: item in col unmasked>-
$eval--1nan<special subsets: rows satisfy expression>-
defaultall51.0allAll