A quickstart example#

This tutorial demonstrates the basic steps and workflows for using PyTTOP in your project. It serves as a solid starting point and reference guide to help you understand its usage. For more detailed information, you can explore the documentation on specific topics.

Loading Data#

The main class used to store tabular data in PyTTOP is Data. Let us first import it:

from pyttop.table import Data

PyTTOP includes example datasets for testing and exploring its features. Let us start by loading an example table called ‘LGM.bio’, which contains the biological properties of the fictional “little green men” (LGM).

from pyttop import get_example
lgm_bio = get_example('LGM.bio')
lgm_bio
<Data 'LGM.bio'>

Note

The “LGM” datasets are randomly generated solely for testing and demonstration purposes, with no real meaning. Please do not interpret them seriously.

As seen, the resulting lgm_bio is as Data object named ‘LGM.bio’. In most cases, you may want to load data from a file, which can be done with:

my_data = Data('path/to/data/file', name='my_data')

See the documentation here for details.

Now, we can take a look at the table we have:

lgm_bio.t
Table length=5000
idsexageheightweight
int64str7int64float64float64
4982Both1525.28587668517817147.21641862351436
3638Both-994.87171846185868149.0944637256166
3857Female1215.97806802150027359.4423357404272
2499Female813.87284589776617936.21057288026628
1938Female574.113471703484627541.66593964783728
1298Male1145.81600107835353357.820407142885564
...............
4176Both2245.8356958980609447.2038049736114
773Male795.1343785235227652.56579589030932
831Neither515.46954641048654459.79200427757866
4569N/A124.67857086240810453.360488208662126
1000Neither1364.60313290527298340.11667243713672
4158Female1867.37104652856850971.33472061653602

Data preprocessing#

It appears that the table contains some missing values, such as -99 in the ‘age’ column. To exclude these values from calculations, plots, and other analyses, we can mask them using the mask_missing() method (see here for details):

lgm_bio.mask_missing(missval=-99)
[mask missing] col 'id': 0/5000 (0.00%) masked (value: -99).
[mask missing] col 'sex': 0/5000 (0.00%) masked (value: -99).
[mask missing] col 'age': 625/5000 (12.50%) masked (value: -99).
[mask missing] col 'height': 0/5000 (0.00%) masked (value: -99).
[mask missing] col 'weight': 0/5000 (0.00%) masked (value: -99).

The output shows that 625 out of 5000 values in the ‘age’ column are missing, while the other columns do not contain -99. However, the 'N/A' entries in the ‘sex’ column also appear to represent missing values. We can mask 'N/A' specifically in the ‘sex’ column as follows:

lgm_bio.mask_missing(['sex'], missval='N/A')
[mask missing] col 'sex': 714/5000 (14.28%) masked (value: N/A).

Now let us check how the above steps take effect:

lgm_bio.t
Table masked=True length=5000
idsexageheightweight
int64str7int64float64float64
4982Both1525.28587668517817147.21641862351436
3638Both--4.87171846185868149.0944637256166
3857Female1215.97806802150027359.4423357404272
2499Female813.87284589776617936.21057288026628
1938Female574.113471703484627541.66593964783728
1298Male1145.81600107835353357.820407142885564
...............
4176Both2245.8356958980609447.2038049736114
773Male795.1343785235227652.56579589030932
831Neither515.46954641048654459.79200427757866
4569--124.67857086240810453.360488208662126
1000Neither1364.60313290527298340.11667243713672
4158Female1867.37104652856850971.33472061653602

We can make more operations on the tables, as introduced here.

Warning

This step should be done before any subset is defined (in the Defining subsets section). This is due to the static nature of susbets (see discussions here).

Matching and merging data#

It is common for the information of interest to be recorded in several separate tables, such as when different properties are obtained from different surveys. Let us load more example tables about LGM:

lgm_addr = get_example('LGM.addr')
lgm_name = get_example('LGM.name')
lgm_house = get_example('LGM.house')
lgm_addr, lgm_name, lgm_house
(<Data 'LGM.addr'>, <Data 'LGM.name'>, <Data 'LGM.house'>)

'LGM.addr' contains the addresses, described by longitude called 'ra' and latitude called 'dec':

lgm_addr.t
Table length=5000
idradec
int64float64float64
4982136.085438641085686.1501075643242
363890.4548782543277526.68961090066547
3857327.2846655844139-35.76782461457626
2499357.166136282663159.75263306402611
1938258.093556536886443.62751110462609
1298288.1021019923507-18.0319889117823
.........
4176351.40564049853225.242588496591069
773172.39697961657959-13.027380226082531
83144.46225264213912-64.93410115013465
4569140.7458210141901411.24210612306192
1000101.248548074726873.9419081639703397
4158129.30640171581323.739275822120526

'LGM.name' contains the names of LGM:

lgm_name.t
Table length=3500
idsurname
int64str10
2885Patterson
1885Sanchez
2814Parker
3571Bryant
1537Jackson
2006Wright
......
1353Stewart
835Washington
4117Knight
3638Parker
1168Perkins
29Scott

and 'LGM.house' contains information about the houses at each location:

lgm_house.t
Table length=5000
radecarea
float64float64float64
136.085438641085686.150107564324230.12559414147148
90.4548782543277526.6896109006654722.588093897628738
327.2846655844139-35.7678246145762624.700022953216845
357.166136282663159.7526330640261126.007538891355942
258.093556536886443.6275111046260924.77640003628158
288.1021019923507-18.031988911782321.509310818316457
.........
351.40564049853225.24258849659106919.203126237682365
172.39697961657959-13.02738022608253122.044461226220715
44.46225264213912-64.9341011501346526.968278557675564
140.7458210141901411.2421061230619223.402062063352645
101.248548074726873.941908163970339718.977101898408353
129.30640171581323.73927582212052622.908667591094602

To analyze our data, we may need to match rows that represent the same instance and merge all the table mentioned above (lgm_bio, lgm_addr, lgm_name, lgm_house). To match tables, we first need to import the appropriate “matchers”:

from pyttop.matcher import ExactMatcher, SkyMatcher

Note that lgm_bio, lgm_addr, and lgm_name all contain the “id” column, so we can use the ExactMatcherto match rows where the “id” values are exactly the same.

lgm_bio.match(lgm_name, ExactMatcher('id', 'id')) # `'id', 'id'` are colmun names for `lgm_bio` and `lgm_name`, respectively.
lgm_bio.match(lgm_addr, ExactMatcher('id')); # if the two column names are the same
[match] "LGM.name" matched to "LGM.bio": 3500/5000 matched.
[match] "LGM.addr" matched to "LGM.bio": 5000/5000 matched.

As shown in the output, all 5000 rows in "lGM.bio" have matching rows in "LGM.addr", but only 3500 of them have matches in "LGM.name". In other words, 1500 “id” values in "lGM.bio" do not have corresponding entries in "LGM.name".

To merge the "LGM.house" table, we can match it with "LGM.addr" based on the longitude and latitude coordinates (ra, dec), using SkyMatcher, since the "LGM.house" table does not contain the “id” column.

lgm_addr.match(lgm_house, SkyMatcher());
[SkyMatcher] Data LGM.addr: found RA name 'ra' and Dec name 'dec'.
[SkyMatcher] Data LGM.house: found RA name 'ra' and Dec name 'dec'.
[match] "LGM.house" matched to "LGM.addr": 5000/5000 matched.

In this case, the ‘ra’ and ‘dec’ columns are automatically detected. Since (ra, dec) may contain errors, entries with the closest positions and distances smaller than a threshold (defaulting to 1 arcsec) are matched. We can also explicitly specify the RA and Dec column names and the threshold thres:

lgm_house.match(lgm_addr, SkyMatcher(thres=1, coord='ra-dec', coord1='ra-dec'));
[match] "LGM.addr" matched to "LGM.house": 5000/5000 matched.

Caution

In astronomy, RA can be expressed in hours. If the RA column does not have units (e.g. data.t['RA'].unit), SkyMatcher assumes the values are in degrees. You may need to specify the units as described here. Failing to do so can lead to incorrect matching results, which may not be immediately apparent.

Tip

By matching using data.match(data1, <...>), we try to find one best matching row (if any) in data1 for each row in data. This implies differences between lgm_addr.match(lgm_house, <...>) and lgm_house.match(lgm_addr, <...>)when , e.g., some instances are included in one table but not in the other table. For more details, see the matching documentation.

For simplicity, we will refer to data.match(data1, <...>) as “data1 matched to data”.

We can check how tables are matched to "lGM.bio", either directly or through intermediate tables, using the match_tree() method:

lgm_bio.match_tree()
Names with parentheses are already matched, thus they are not expanded and will be ignored when merging.
---------------
LGM.bio [base]
:   LGM.name [ExactMatcher("id", "id")]
:   LGM.addr [ExactMatcher("id", "id")]
:   :   LGM.house [<SkyMatcher with thres=1>]
:   :   :   (LGM.addr) [<SkyMatcher with thres=1>]
---------------

As shown, two datasets, "LGM.name" and "LGM.addr", are matched directly to our base data, "lGM.bio", using the ExactMatcher. "LGM.house" is matched to "LGM.addr" and can thus be merged as well. Although "LGM.addr" is also matched to "LGM.house", since we already have "LGM.addr", we do not need to count it again.

We can now merge all these tables using the merge() method:

lgm = lgm_bio.merge() # `lgm` is the merged dataset
lgm.t
Names with parentheses are already matched, thus they are not expanded and will be ignored when merging.
---------------
LGM.bio
:   LGM.name
:   LGM.addr
:   :   LGM.house
:   :   :   (LGM.addr)
---------------
[merge] merged: LGM.bio, LGM.name, LGM.addr, LGM.house
Table length=3500
id_LGM.biosexageheightweightid_LGM.namesurnameid_LGM.addrra_LGM.addrdec_LGM.addrra_LGM.housedec_LGM.housearea
int64str7int64float64float64int64str10int64float64float64float64float64float64
4982Both1525.28587668517817147.216418623514364982Johnston4982136.085438641085686.1501075643242136.085438641085686.150107564324230.12559414147148
3638Both--4.87171846185868149.09446372561663638Parker363890.4548782543277526.6896109006654790.4548782543277526.6896109006654722.588093897628738
2499Female813.87284589776617936.210572880266282499Hughes2499357.166136282663159.75263306402611357.166136282663159.7526330640261126.007538891355942
1938Female574.113471703484627541.665939647837281938Richardson1938258.093556536886443.62751110462609258.093556536886443.6275111046260924.77640003628158
1298Male1145.81600107835353357.8204071428855641298Adams1298288.1021019923507-18.0319889117823288.1021019923507-18.031988911782321.509310818316457
2788--1604.94009383789992642.1315095719148052788Simmons2788184.616222832333928.355354989748907184.616222832333928.35535498974890722.782082185683507
.......................................
4366Neither--5.74000448014195666.017877505686234366Alvarez43662.84312279507514867.185381723394062.84312279507514867.1853817233940627.44336793418737
773Male795.1343785235227652.56579589030932773Knight773172.39697961657959-13.027380226082531172.39697961657959-13.02738022608253122.044461226220715
831Neither515.46954641048654459.79200427757866831Cameron83144.46225264213912-64.9341011501346544.46225264213912-64.9341011501346526.968278557675564
4569--124.67857086240810453.3604882086621264569Cameron4569140.7458210141901411.24210612306192140.7458210141901411.2421061230619223.402062063352645
1000Neither1364.60313290527298340.116672437136721000Knight1000101.248548074726873.9419081639703397101.248548074726873.941908163970339718.977101898408353
4158Female1867.37104652856850971.334720616536024158Cooper4158129.30640171581323.739275822120526129.30640171581323.73927582212052622.908667591094602

The resulting table includes only 3500 rows, as only 3500 out of 5000 were matched when matching "LGM.name" to "LGM.bio". However, we may want to retain rows without "LGM.name" information. Also, the resulting table appears redundant as, e.g., it includes the “id” column from three datasets: 'id_LGM.bio', 'id_LGM.name', and 'id_LGM.addr'. We can control which columns should be ignored or merged. To achieve this, we can make some adjustments to the code:

lgm = lgm_bio.merge(
    keep_unmatched=['LGM.name'], # set it to True to keep unmatched rows for all tables
    merge_columns={
        'LGM.name': ['surname'], # only merges this column
    },
    ignore_columns={
        'LGM.addr': ['id'],
        'LGM.house': ['ra', 'dec'], # do not merge these columns
    },
)
lgm.t
Names with parentheses are already matched, thus they are not expanded and will be ignored when merging.
---------------
LGM.bio
:   LGM.name
:   LGM.addr
:   :   LGM.house
:   :   :   (LGM.addr)
---------------
[merge] entries with no match for LGM.name is kept.
[merge] merged: LGM.bio, LGM.name, LGM.addr, LGM.house
Table length=5000
idsexageheightweightsurnameradecarea
int64str7int64float64float64str10float64float64float64
4982Both1525.28587668517817147.21641862351436Johnston136.085438641085686.150107564324230.12559414147148
3638Both--4.87171846185868149.0944637256166Parker90.4548782543277526.6896109006654722.588093897628738
3857Female1215.97806802150027359.4423357404272--327.2846655844139-35.7678246145762624.700022953216845
2499Female813.87284589776617936.21057288026628Hughes357.166136282663159.7526330640261126.007538891355942
1938Female574.113471703484627541.66593964783728Richardson258.093556536886443.6275111046260924.77640003628158
1298Male1145.81600107835353357.820407142885564Adams288.1021019923507-18.031988911782321.509310818316457
...........................
4176Both2245.8356958980609447.2038049736114--351.40564049853225.24258849659106919.203126237682365
773Male795.1343785235227652.56579589030932Knight172.39697961657959-13.02738022608253122.044461226220715
831Neither515.46954641048654459.79200427757866Cameron44.46225264213912-64.9341011501346526.968278557675564
4569--124.67857086240810453.360488208662126Cameron140.7458210141901411.2421061230619223.402062063352645
1000Neither1364.60313290527298340.11667243713672Knight101.248548074726873.941908163970339718.977101898408353
4158Female1867.37104652856850971.33472061653602Cooper129.30640171581323.73927582212052622.908667591094602

As seen, those without match for “surname” are masked.

Now with our combined dataset, we are ready to analyze it.

Defining subsets#

It is common to be interested in subsets of a dataset. For example, we might want to know how many of the “little green men” have the surname “Smith”. So let us import Subset and add a subset:

from pyttop.table import Subset

smith = lgm.add_subsets(
    Subset.by_value('surname', 'Smith'), # Defines a subset for rows where the 'surname' column is 'Smith'
    )
smith
<Subset 'surname=Smith' of Data '(LGM.bio).MATCH(LGM.name, LGM.addr, LGM.house)' (47/5000)>

As shown in the output, 47 out of 5000 entries have the surname “Smith”.

We may also want to study potential systematic differences between sexes. To do so, let us create a “subset group” for the different sexes:

lgm.subset_group_from_values('sex')
lgm.get_subsets(name=['sex=Male', 'sex=Female'])
[subset] Found subset 'sex=Male' in group 'sex'.
[subset] Found subset 'sex=Female' in group 'sex'.
[<Subset 'sex=Male' of Data '(LGM.bio).MATCH(LGM.name, LGM.addr, LGM.house)' (1036/5000)>,
 <Subset 'sex=Female' of Data '(LGM.bio).MATCH(LGM.name, LGM.addr, LGM.house)' (1031/5000)>]

Here are all subsets we have:

lgm.subset_summary()
Table length=8
groupnamesizefractionexpressionlabel
str9str13int64float64str42str7
$unmasked--1nan<special subsets: item in col unmasked>-
$eval--1nan<special subsets: rows satisfy expression>-
defaultall50001.0allAll
defaultsurname=Smith470.0094surname=SmithSmith
sexsex=Both10960.2192sex=BothBoth
sexsex=Female10310.2062sex=FemaleFemale
sexsex=Male10360.2072sex=MaleMale
sexsex=Neither11230.2246sex=NeitherNeither

There are more ways to define and use subsets. See the subset documentation for details.

Making plots#

To quicky explore the patterns in our data, it is a good idea to create some plots.

Let us begin by plotting weight versus height for all entries:

lgm.scatter('height', 'weight', c='k', s=.5, subsets='all')
(<Figure size 640x480 with 1 Axes>,
 <Axes: title={'center': 'All'}, xlabel='height', ylabel='weight'>)
../_images/71769a3b663e0b7fc49fc59cfa870cf0f10f90c6dfdad2f4ee4cfd53ed2a9fa9.png

Is there any differences between sexes?

lgm.scatter('height', 'weight', s=.5, 
    group='sex',
    )
(<Figure size 640x480 with 1 Axes>, <Axes: xlabel='height', ylabel='weight'>)
../_images/c22ac01ed537d795fd589d6ff617230f21d9bb715b5cf15d93d97753cd8b6f25.png

It seems there is no sexual dependance for height and weight. Then, can there be a dependance on age? We can color-code the markers by age:

lgm.scatter('height', 'weight', s=.5, c='age', cmap='jet', subsets='all')
(<Figure size 640x480 with 2 Axes>,
 <Axes: title={'center': 'All'}, xlabel='height', ylabel='weight'>)
../_images/944052b78f181d5676ce53cf8f0b7009687fd8336401c7ff0527814ba5f50fd5.png

Great, so their weight at a given height is determined by the age. By the way, we can change lgm.scatter() to lgm.plots() for more general usage (see here for details):

lgm.plots(
    'scatter', # making a scatter plot
    cols=('height', 'weight'), # column 'weight' versus 'height'
    kwcols={'c': 'age'}, # color-coding: column 'age'
    s=.5, cmap='jet', # more scatter settings
    subsets='all', # plotting all rows
    )
(<Figure size 640x480 with 2 Axes>,
 <Axes: title={'center': 'All'}, xlabel='height', ylabel='weight'>)
../_images/944052b78f181d5676ce53cf8f0b7009687fd8336401c7ff0527814ba5f50fd5.png

It seems that the dependence on age can be better shown by the following correlation:

lgm.plots(
    'scatter', # making a scatter plot
    cols=('10*height - weight', 'age'), # 'age' versus 'weight - 10**height' 
    eval=True, # we need to evaluate the above expression (since '10*height - weight' is not a column name)
    s=.5, # more scatter settings
    subsets='all', # plotting all rows
    )
(<Figure size 640x480 with 1 Axes>,
 <Axes: title={'center': 'All'}, xlabel='10*height - weight', ylabel='age'>)
../_images/72cacb763fb94ac1967e60dd61e3da848eeba3fa7b6e0ca9c4f8187f19f2f3b7.png

This appears to be a good correlation. So let us define a new quantity \(q\): \( q = W - 10H, \) where \(W\) denotes weight and \(H\) denotes height. We can calculate this and add it as a new column (see here for details):

lgm.eval('10*height - weight', to_col='q'); # evaluate '10*height - weight' and store it in the column 'q'

Great, let us report this in our new paper, On the Properties of the Little Green Men. However, we might want to set the labels more formally, rather than directly showing column names (see here for details):

lgm.set_labels(
    height=r'$H/\mathrm{cm}$',
    weight=r'$W/\mathrm{g}$',
    age=r'$\tau/\mathrm{yr}$',
    )

Now, with more control, we can further customize the plots (see here for details):

import matplotlib.pyplot as plt
fig, axes = plt.subplots(1, 2, figsize=(10.8, 4.8))

lgm.plots(
    'scatter', # making a scatter plot
    cols=('height', 'weight'), # column 'weight' versus 'height'
    kwcols={'c': 'age'}, # color-coding: column 'age'
    s=.5, cmap='jet', # more scatter settings
    subsets='all', # plotting all entries
    ax=axes[0], # plot it in the first axis
    )
lgm.plots(
    'scatter', # making a scatter plot
    cols=('10*height - weight', 'age'), # 'age' versus 'weight - 10**height' 
    eval=True, # we need to evaluate the above expression ('10*height - weight' is not a column name)
    s=.5, c='k', # more scatter settings
    subsets='all', # plotting all entries
    ax=axes[1], # plot it in the second axis
    )

# manual adjustments
axes[0].set_title('W-H correlation')
axes[1].set_title('My correlation!')
axes[1].set_xlabel('$10H-W$')
plt.tight_layout()
../_images/15f57d9ed8e8c2da73d3704ea2cea329bed3128d8c247cb2e760ed003b2058cc.png

As another example, let us study the house information of the LGMs.

lgm.plots(
    'scatter',
    cols=('ra', 'dec'),
    s=1, c='k',
    subsets='all',
    )
(<Figure size 640x480 with 1 Axes>,
 <Axes: title={'center': 'All'}, xlabel='ra', ylabel='dec'>)
../_images/543d2d4e4eaab570c27a3fe8d39c1ba3e1f258a22032eef283ba9cdcef1bc976.png

Suppose we are interested in the locations where LGMs with the surname “Smith” live:

lgm.plots(
    'scatter',
    cols=('ra', 'dec'),
    subsets=(~smith, smith), # `smith` is a subset defined eariler; `~smith` is its complementary set (not in `smith`)
    iter_kwargs={ # for the two subsets, `~smith` and `smith`, use two different settings
        's': [.5, 5],
        'c': ['gray', 'r'],
        'marker': ['.', '*'],
        },
    )
(<Figure size 640x480 with 1 Axes>, <Axes: xlabel='ra', ylabel='dec'>)
../_images/c5f99039fa2ccd36b29a17e00734857f2c46da98dd0b88437971dc9a8a2441e6.png

There appears to be a group of LGMs with the surname “Smith” living around ra = 80, dec = 60.

There are many more features of the plots() method. Check the documentation for details.

Saving and exporting data#

Now, we would like to save our merged dataset for later use. With the following code, we can save it to the lgm.data file, which can be loaded later by PyTTOP (see here for details).

# lgm.save('path/to/lgm.data')

To publish or share our dataset, we may also need to export it using a certain format (see here for more information):

# lgm.save('path/to/lgm.txt', format='ascii')