Tutorial on matching#

This notebook will demostrate how to match catalogs with this package.

Data I/O#

In this package, data are handled by the pyttop.table.Data class. Intuitively (though not accurately), a ‘class’ is a type of objects that can store information and perform certain operations. As we will see later, we can initialize an pyttop.table.Data object, which can store a data table (in the memory) and allows you to do operations including matching, merging and more.

To get started, import the pyttop.table.Data class:

from pyttop.table import Data

Reading and writing Table-like data#

The data table of an pyttop.table.Data object is stored in an astropy.table.Table object, so you can load anything that can be converted to an astropy.table.Table. For introduction to tables and documentations on the astropy.table.Table, see Astropy’s documentation on Data Tables, especially the supported formats of Astropy’s Built-In Table Readers/Writers.

The most straightforward way to load a table is to initialize a Data object with the path to the data file (note that the files in ./samples/ directory are randomly generated datasets):

cat1 = Data('samples/catalog1.csv', name='cat1') # cat1 is an astropy.table.Table object

It is highly recommended to input a name keyword argument, as this name will be used to distinguish different datasets. If Data is initialized with a path to file and no name is given, it will be automatically set to the file name.

The astropy.table.Table object used to store the table can be accessed with data.t. Thus, you can do anything as you can do with an astropy.table.Table object.

print("Name:", cat1.name)
cat1.t # an astropy.table.Table
Name: cat1
Table length=100
survey1_idRADecAB
int64float64float64int64int64
0134.8344427850505-87.171373288193921648312
1342.2571503075698-32.72306298625976553935
2263.51781905210584-61.70796170313069637172
3215.5170543109332-44.22863779517675419919
456.16671055927715-8.3190173466516348445320
556.158027321032954-67.563699376601252557263
620.910100380551807-53.065536926793325592493
7311.82341247897665-22.00039753112561498399
8216.40140422755516-69.408165105753982200141
...............
9043.053928537788615-81.62075089746907559647
91256.7681234002782-9.250581784200591580188
92273.88261750208306-8.9623748553002542806487
93202.05979112501865-33.020868845405886537236
94277.54818478364194-59.487318805616945986372
95177.76641469118067-58.5711382848605241841384
96188.18381857751785-24.663988901678458716271
97153.91476660907787-9.2600766042680656971188
989.150885627874267-10.1622218161394365625486
9938.8409137175896-19.8112008728138562950191

You may also add keyword arguments to be passed to astropy.table.Table.read():

cat3id = Data('samples/catalog3_id.txt', name='cat3id',
              names=['cat3ID', 'survey2_id'], format='ascii') # 'ascii' is one of the supported formats of Astropy's Built-In Table Readers/Writers.
cat3v = Data('samples/catalog3_measurement.txt', name='cat3v',
             names=['survey2_id', 'x', 'y', 'class1'], format='ascii')

A Data object can also be created with an astropy.table.Table object, or anything that can be converted to an astropy.table.Table object.

from astropy.table import Table
cat2_table = Table.read('samples/catalog2.fits')
cat2 = Data(cat2_table, name='cat2')
cat4_dict = dict(Table.read('samples/catalog4.hdf5'))
# print(cat4_dict)
cat4 = Data(cat4_dict, name='cat4')

You can save the table to files with Astropy’s table writers (see Astropy’s documentation on Reading and Writing Table Objects):

# cat1.t.write(filename, format=supported_format)

Reading and writing an pyttop.table.Data object#

Though you can write the table with the astropy’s writers (as shown above), it is more recommended to use the save() method of the pyttop.table.Data object itself (note that it is different from cat1.t.write()). This method not only saves the data table, but also saves other key properties of the data (e.g. the user-defined row subsets) of an pyttop.table.Data object.

cat1.save('samples/output/cat1', overwrite=True) # set overwrite=True to overwrite an existing file

The data is saved to 'samples/output/cat1.data'.

Note that the data’s matching with other data is not saved. If you want to match the data (say cat1) with other datasets (say cat4), you have to merge the datasets into one dataset (say merged_cat), and save the merged dataset (merged_cat). If you save cat1 to a “.data” file and load it later, you are unable to recover the match between cat1 and cat4.

To load a “.data” file and (mostly) recover cat1, the pyttop.table.Data object:

cat1 = Data.load('samples/output/cat1.data')

Matching#

In this package, catalog B is said to be matched to A, if each record (row) in A is assigned two values:

  • Whether it can be matched to a record in catalog B;

  • The index of the best match record in catalog B (if no match possible, the index can be any number but means nothing).

A is referred to as the base data of the match.

Matching with a built-in matcher#

To match cat4 to cat1 with the exact value of the 'survey1_id' field in cat1 and the 'survey1_id' field in cat4, use an ExactMatcher:

from pyttop.matcher import ExactMatcher
cat1.match(cat4, ExactMatcher('survey1_id', 'survey1_id'))
[match] "cat4" matched to "cat1": 56/100 matched.
<Data 'cat1'>

Since there are more than one records for the same 'survey1_id' in cat4, matching cat1 to cat4 is not equal to matching cat4 to cat1:

cat4.match(cat1, ExactMatcher('survey1_id', 'survey1_id'))
[match] "cat1" matched to "cat4": 70/70 matched.
<Data 'cat4'>

You may use any iterable object (e.g. an array) to match the catalogs, provided that what is used to match catalogs has the same length (i.e. number of records) as the catalogs.

print('len(cat3v) =', len(cat3v))
print('len(cat3id) =', len(cat3id))
cat2.match(cat3v, ExactMatcher('survey2_id', cat3id.t['survey2_id']))
len(cat3v) = 110
len(cat3id) = 110
[match] "cat3v" matched to "cat2": 110/150 matched.
<Data 'cat2'>

You can also match data with thier coordinates:

from pyttop.matcher import SkyMatcher
import astropy.units as u
cat1.match(cat2, SkyMatcher(unit=u.deg, unit1=(u.h, u.deg))) # RA for cat1 is dms; RA for cat2 is hms.
[SkyMatcher] Data cat1: found RA name 'RA' and Dec name 'Dec'.
[SkyMatcher] Data cat2: found RA name 'RA' and Dec name 'Dec'.
[match] "cat2" matched to "cat1": 90/100 matched.
<Data 'cat1'>

For more information on SkyMatcher, use help(SkyMatcher).

To remove all matches to cat1, use:

# cat1.reset_match()

To unmatch cat2 from cat1:

cat1.unmatch(cat2)
[match] "cat2" unmatched to "cat1".

For SkyMatcher, you can also explore the distribution of minimum sky distances, i.e. the distance to the nearest object in cat2 for each object in cat1:

skymatcher = SkyMatcher(unit=u.deg, unit1=(u.h, u.deg))
skymatcher.explore(cat1, cat2)
cat1.match(cat2, skymatcher)
[SkyMatcher] Data cat1: found RA name 'RA' and Dec name 'Dec'.
[SkyMatcher] Data cat2: found RA name 'RA' and Dec name 'Dec'.
[match] "cat2" matched to "cat1": 90/100 matched.
<Data 'cat1'>
../_images/565b1da3ec78d6021f90b2f4c103977681c8d033e1fa3b47fddc53d3c50aa485.png

Defining custom matchers#

Note: This part is for advanced users. If you are new to this package, you may skip this part for now.

You may also define your own matchers. A macther class should be defined like this:

class MyMatcher():
    def __init__(self, args): # 'args' means any number of arguments that you need
        # initialize it with args you need
        pass
    
    def get_values(self, data, data1, verbose=True): # data1 is matched to data
        # prepare the data that is needed to do the matching (if necessary)
        pass
    
    def match(self):
        # do the matching process and calculate:
        # idx : array of shape (len(data), ). 
        #     the index of a record in data1 that best matches the records in data
        # matched : boolean array of shape (len(data), ).
        #     whether the records in data can be matched to those in data1.
        return idx, matched

Merging catalogs#

Match tree#

If B is matched to A, I call A as the child data of B, and B as the parent data of A.

Say B, C are matched to A, and D is matched to B. Then B, C are children of A, and D is child of B. When we try to merge everything into A (i.e. merge the information in A’s chilren, grandchildren, etc. into A), it may be useful to see all of its children/grandchildren, or what I call the match tree:

cat1.match_tree(detail=False)
Names with parentheses are already matched, thus they are not expanded and will be ignored when merging.
---------------
cat1
:   cat4
:   :   (cat1)
:   cat2
:   :   cat3v
---------------

From the match tree we may see that cat4 and cat2 are matched to cat1 and cat3v is matched to cat2. Although cat1 is also matched to cat4, this match is a duplication in this match tree, and will be ignored when merging everything (cat4, cat2 and cat3v) into cat1.

For more information on how they are matched:

cat1.match_tree(detail=True)
Names with parentheses are already matched, thus they are not expanded and will be ignored when merging.
---------------
cat1 [base]
:   cat4 [ExactMatcher("survey1_id", "survey1_id")]
:   :   (cat1) [ExactMatcher("survey1_id", "survey1_id")]
:   cat2 [<SkyMatcher with thres=1>]
:   :   cat3v [ExactMatcher("survey2_id", "survey2_id")]
---------------

For example, we may also use cat4 as the base catalog:

cat4.match_tree(detail=False)
Names with parentheses are already matched, thus they are not expanded and will be ignored when merging.
---------------
cat4
:   cat1
:   :   (cat4)
:   :   cat2
:   :   :   cat3v
---------------

Catalog merging#

Now we can merge everything possible to be merged into cat1:

merged_cat = cat1.merge(outname='my_merged_catalog')
print("Name:", merged_cat.name)
merged_cat.t
Names with parentheses are already matched, thus they are not expanded and will be ignored when merging.
---------------
cat1
:   cat4
:   :   (cat1)
:   cat2
:   :   cat3v
---------------
[merge] merged: cat1, cat4, cat2, cat3v
Name: my_merged_catalog
Table length=37
survey1_id_cat1RA_cat1Dec_cat1ABidsurvey1_id_cat4ijsurvey2_id_cat2RA_cat2Dec_cat2absurvey2_id_cat3vxyclass1
int64float64float64int64int64int32int32int32int32int32float64float64int32int32int64int64int64str8
0134.8344427850505-87.17137328819392164831261046123147948.98895958611857-87.1713732881939231915094671583Type II
1342.2571503075698-32.7230629862597655393534127404889022.817134875833332-32.7230629862597653833890288458Type II
2263.51781905210584-61.70796170313069637172823568694410317.56785302820989-61.707961703130694316103523685Type II
7311.82341247897665-22.000397531125614983996475238219010120.78823016812825-22.000397531125614500548101223116Type III
9254.90612800657638-83.071808115408632961459529891545011316.993742542070294-83.0718081154086313087113515659Type II
11349.16754677831796-75.49008414713964433470711611660835323.277828530405216-75.490084147139634952853674272Type III
16109.52720746543358-17.6695130790796926944287361612391347837.3018168152337175-17.66951307907969224901683569462Type III
19104.8424904712951-41.45919822759143649113416319119848061046.989499173053064-41.459198227591436442573104556785Type III
20220.2670421000566-17.330386035234373987501120220354067814.68447580405005-17.3303860352343730792578473165Type I
......................................................
69355.2792971761862-36.819635113058248319913569228971144623.685285687773018-36.819635113058242079146257590Type II
7726.65607462427253-55.193818832951635386327823772815741101.7770800016118025-55.193818832951635345279110221394Type II
78129.04766227593814-5.694301013693888745512607870296887428.603179857588493-5.69430101369388824829142195720Type III
7941.7128614290467-77.623115026860640143925579955682102.780858910895596-77.623115026860647465039488Type II
8322.88100610296851-11.039458195711703154121338384673230961.5254003596640098-11.0394581957117033983979657083Type III
86262.6582242017031-16.4500019818905880962891986395332914317.510545262394547-16.4500019818905895161433921Type II
88319.39658732747756-42.3314479479594273431253288458567327521.293101039889653-42.3314479479594218236175680773Type II
93202.05979112501865-33.02086884540588653723618931939367313613.470648167799785-33.020868845405886539925136241788Type I
989.150885627874267-10.1622218161394365625486149819253124810.6100575664012837-10.16222181613943628067281413520Type III
9938.8409137175896-19.811200872813856295019117999562408692.589396089660706-19.81120087281385648733269277716Type II

Note that columns with the same names are renamed by the name of the Data objects. You may also check that the match is indeed correct.

We may now save the merged catalog for later use:

merged_cat.save('samples/output/merged_cat', overwrite=True) # you don't need overwrite=True if file 'samples/output/merged_cat.data' does not exist
# load with:
# merged_cat = Data.load('samples/output/merged_cat.data')

Source of a column#

Sometimes we forget the name of the data from which a column is merged. For example, we are sure that the column named 'A' is merged from somewhere, but cannot recall the name of that dataset. We can use the from_which() method:

merged_cat.from_which('A')
'cat1 (samples/catalog1.csv)'

The column 'A' indeed comes from cat1, which is loaded from "samples/catalog1.csv".

WARNING. The from_which() method has several limitations:

  • In some cases, the software cannot decide the source of the column, and from_which() will return an empty string.

  • Direct operations on the data table (astropy.table.Table), especially adding columns to the data table using the values of other columns, can result in incorrect results of from_which(). For example, instead of:

# WRONG:
merged_cat.t['A+i (wrong)'] = merged_cat.t['A'] + merged_cat.t['i'] # adding a new column
merged_cat.from_which('A+i (wrong)') # wrong result
'cat1 (samples/catalog1.csv)'

use:

# CORRECT:
merged_cat['A+i (right)'] = merged_cat['A'] + merged_cat['i'] # adding a new column
merged_cat.from_which('A+i (right)') # right result: this column is added by the user
'user-added (set by user)'

Merging options#

Maybe you want to keep records that cannot be matched to cat3v and only want to merge subsets of columns from the catalogs:

merge_columns = { # specify columns to be merged
    'cat1': ['survey1_id', 'RA', 'Dec'],
    'cat4': ['i', 'j'],
    'cat2': ['survey2_id'],
    'cat3v': ['class1'],
    }

keep_unmatched = ['cat3v'] # keep records that cannot be matched to cat3v

another_merged_cat = cat1.merge(keep_unmatched=keep_unmatched, 
                                merge_columns=merge_columns) # use default outname
print("Name:", another_merged_cat.name)
another_merged_cat.t
Names with parentheses are already matched, thus they are not expanded and will be ignored when merging.
---------------
cat1
:   cat4
:   :   (cat1)
:   cat2
:   :   cat3v
---------------
[merge] entries with no match for cat3v is kept.
[merge] merged: cat1, cat4, cat2, cat3v
Name: (cat1).MATCH(cat4, cat2, cat3v)
Table length=51
survey1_idRADecijsurvey2_idclass1
int64float64float64int32int32int32str8
0134.8344427850505-87.171373288193924612314794Type II
1342.2571503075698-32.72306298625976274048890Type II
2263.51781905210584-61.707961703130635686944103Type II
456.16671055927715-8.31901734665163453421667122--
556.158027321032954-67.56369937660125319298692--
7311.82341247897665-22.00039753112561452382190101Type III
9254.90612800657638-83.071808115408638915450113Type II
107.410417946488881-63.9223692377608761969138814--
11349.16754677831796-75.49008414713966116608353Type III
.....................
7726.65607462427253-55.193818832951635281574110Type II
78129.04766227593814-5.6943010136938887029688742Type III
7941.7128614290467-77.623115026860695568210Type II
8322.88100610296851-11.0394581957117038467323096Type III
86262.6582242017031-16.450001981890583953329143Type II
88319.39658732747756-42.331447947959424585673275Type II
93202.05979112501865-33.02086884540588619393673136Type I
96188.18381857751785-24.663988901678451742577317--
989.150885627874267-10.1622218161394361925312481Type III
9938.8409137175896-19.811200872813856956240869Type II

You can also specify the columns to be ignored during merging (the ignore_columns argument of merge()); see help(Data.merge) for details.

You may also set the depth for match_tree and merge methods. Setting depth to 0 means only keeping the base catalog itself.

cat4.match_tree(depth=2)
Names with parentheses are already matched, thus they are not expanded and will be ignored when merging.
---------------
cat4 [base]
:   cat1 [ExactMatcher("survey1_id", "survey1_id")]
:   :   (cat4) [ExactMatcher("survey1_id", "survey1_id")]
:   :   cat2 [<SkyMatcher with thres=1>]
---------------
cat4.match_tree(depth=0)
Names with parentheses are already matched, thus they are not expanded and will be ignored when merging.
---------------
cat4 [base]
---------------
cat4.merge(depth=1).t
Names with parentheses are already matched, thus they are not expanded and will be ignored when merging.
---------------
cat4
:   cat1
---------------
[merge] merged: cat4, cat1
Table length=70
idsurvey1_id_cat4ijsurvey1_id_cat1RADecAB
int32int32int32int32int64float64float64int64int64
0787029688778129.04766227593814-5.6943010136938887455126
1336099395233341.59879341119995-59.6146345736734843436325
2359499468335291.023045321926-60.911736118132026589553
353192986556.158027321032954-67.563699376601252557263
4383178186338246.32388954437647-57.2733357858635444893100
5416052444041178.26368764005727-67.33959337571723560052
657273449095770.55383047089227-21.45423462041542606307
7116116608311349.16754677831796-75.49008414713964433470
82356869442263.51781905210584-61.70796170313069637172
...........................
60937519723293202.05979112501865-33.020868845405886537236
610461231470134.8344427850505-87.171373288193921648312
62497104089456.16671055927715-8.3190173466516348445320
63191198480619104.8424904712951-41.4591982275914364911341
647523821907311.82341247897665-22.00039753112561498399
654260899164212.379867601478622-45.24763446968531799659
66756257398675262.4425804947554-74.307021389550771895116
670770528750134.8344427850505-87.171373288193921648312
6856371967485631.85730073869102-29.50780073347093301423
6963302559263128.43119760969213-41.78027843327174980157

Things to be noted#

This package does not support matching multiple records to a single record in the base data. For example, table2 below has two records with the same survey_id:

table1 = Data({'survey_id': [0, 1, 2], 'value': ['A', 'B', 'C']}, name='t1')
table1.t
Table length=3
survey_idvalue
int64str1
0A
1B
2C
table2 = Data({'table2_id': [0, 1, 2], 'survey_id': [0, 1, 0]}, name='t2')
table2.t
Table length=3
table2_idsurvey_id
int64int64
00
11
20

If you match table2 to table1 by survey_id using ExactMatcher, the first exact match in table2 will be used:

table1.match(table2, ExactMatcher('survey_id', 'survey_id')).merge().t
[match] "t2" matched to "t1": 2/3 matched.
Names with parentheses are already matched, thus they are not expanded and will be ignored when merging.
---------------
t1
:   t2
---------------
[merge] merged: t1, t2
Table length=2
survey_id_t1valuetable2_idsurvey_id_t2
int64str1int64int64
0A00
1B11

If you wish to keep the records with the same sruvey_id in table2, you may match table1 to table2 instead of matching table2 to table1:

table2.match(table1, ExactMatcher('survey_id', 'survey_id')).merge().t
[match] "t1" matched to "t2": 3/3 matched.
Names with parentheses are already matched, thus they are not expanded and will be ignored when merging.
---------------
t2
:   t1
:   :   (t2)
---------------
[merge] merged: t2, t1
Table length=3
table2_idsurvey_id_t2survey_id_t1value
int64int64int64str1
000A
111B
200A

Or you may merge these records (with the same sruvey_id) before matching and merging the catalogs.