Tutorial on matching

Tutorial on matching#

This notebook will demostrate how to match catalogs with this package.

Data I/O#

In this package, data are handled by the pyttop.table.Data class. Intuitively (though not accurately), a ‘class’ is a type of objects that can store information and perform certain operations. As we will see later, we can initialize an pyttop.table.Data object, which can store a data table (in the memory) and allows you to do operations including matching, merging and more.

To get started, import the pyttop.table.Data class:

from pyttop.table import Data

Reading and writing `Table`-like data#

The data table of an pyttop.table.Data object is stored in an astropy.table.Table object, so you can load anything that can be converted to an astropy.table.Table. For introduction to tables and documentations on the astropy.table.Table, see Astropy’s documentation on Data Tables, especially the supported formats of Astropy’s Built-In Table Readers/Writers.

The most straightforward way to load a table is to initialize a Data object with the path to the data file (note that the files in ./samples/ directory are randomly generated datasets):

cat1 = Data('samples/catalog1.csv', name='cat1') # cat1 is an astropy.table.Table object

It is highly recommended to input a name keyword argument, as this name will be used to distinguish different datasets. If Data is initialized with a path to file and no name is given, it will be automatically set to the file name.

The astropy.table.Table object used to store the table can be accessed with data.t. Thus, you can do anything as you can do with an astropy.table.Table object.

print("Name:", cat1.name)
cat1.t # an astropy.table.Table

Name: cat1

Table length=100

survey1_id	RA	Dec	A	B
int64	float64	float64	int64	int64
0	134.8344427850505	-87.17137328819392	1648	312
1	342.2571503075698	-32.72306298625976	5539	35
2	263.51781905210584	-61.7079617031306	9637	172
3	215.5170543109332	-44.22863779517675	4199	19
4	56.16671055927715	-8.319017346651634	8445	320
5	56.158027321032954	-67.56369937660125	2557	263
6	20.910100380551807	-53.06553692679332	5592	493
7	311.82341247897665	-22.000397531125614	98	399
8	216.40140422755516	-69.40816510575398	2200	141
...	...	...	...	...
90	43.053928537788615	-81.62075089746907	5596	47
91	256.7681234002782	-9.250581784200591	5801	88
92	273.88261750208306	-8.962374855300254	2806	487
93	202.05979112501865	-33.020868845405886	537	236
94	277.54818478364194	-59.48731880561694	5986	372
95	177.76641469118067	-58.571138284860524	1841	384
96	188.18381857751785	-24.66398890167845	8716	271
97	153.91476660907787	-9.260076604268065	6971	188
98	9.150885627874267	-10.162221816139436	5625	486
99	38.8409137175896	-19.811200872813856	2950	191

You may also add keyword arguments to be passed to astropy.table.Table.read():

cat3id = Data('samples/catalog3_id.txt', name='cat3id',
              names=['cat3ID', 'survey2_id'], format='ascii') # 'ascii' is one of the supported formats of Astropy's Built-In Table Readers/Writers.
cat3v = Data('samples/catalog3_measurement.txt', name='cat3v',
             names=['survey2_id', 'x', 'y', 'class1'], format='ascii')

A Data object can also be created with an astropy.table.Table object, or anything that can be converted to an astropy.table.Table object.

from astropy.table import Table
cat2_table = Table.read('samples/catalog2.fits')
cat2 = Data(cat2_table, name='cat2')
cat4_dict = dict(Table.read('samples/catalog4.hdf5'))
# print(cat4_dict)
cat4 = Data(cat4_dict, name='cat4')

You can save the table to files with Astropy’s table writers (see Astropy’s documentation on Reading and Writing Table Objects):

# cat1.t.write(filename, format=supported_format)

Reading and writing an `pyttop.table.Data` object#

Though you can write the table with the astropy’s writers (as shown above), it is more recommended to use the save() method of the pyttop.table.Data object itself (note that it is different from cat1.t.write()). This method not only saves the data table, but also saves other key properties of the data (e.g. the user-defined row subsets) of an pyttop.table.Data object.

cat1.save('samples/output/cat1', overwrite=True) # set overwrite=True to overwrite an existing file

The data is saved to 'samples/output/cat1.data'.

Note that the data’s matching with other data is not saved. If you want to match the data (say cat1) with other datasets (say cat4), you have to merge the datasets into one dataset (say merged_cat), and save the merged dataset (merged_cat). If you save cat1 to a “.data” file and load it later, you are unable to recover the match between cat1 and cat4.

To load a “.data” file and (mostly) recover cat1, the pyttop.table.Data object:

cat1 = Data.load('samples/output/cat1.data')

Matching#

In this package, catalog B is said to be matched to A, if each record (row) in A is assigned two values:

Whether it can be matched to a record in catalog B;
The index of the best match record in catalog B (if no match possible, the index can be any number but means nothing).

A is referred to as the base data of the match.

Matching with a built-in matcher#

To match cat4 to cat1 with the exact value of the 'survey1_id' field in cat1 and the 'survey1_id' field in cat4, use an ExactMatcher:

from pyttop.matcher import ExactMatcher
cat1.match(cat4, ExactMatcher('survey1_id', 'survey1_id'))

[match] "cat4" matched to "cat1": 56/100 matched.

<Data 'cat1'>

Since there are more than one records for the same 'survey1_id' in cat4, matching cat1 to cat4 is not equal to matching cat4 to cat1:

cat4.match(cat1, ExactMatcher('survey1_id', 'survey1_id'))

[match] "cat1" matched to "cat4": 70/70 matched.

<Data 'cat4'>

You may use any iterable object (e.g. an array) to match the catalogs, provided that what is used to match catalogs has the same length (i.e. number of records) as the catalogs.

print('len(cat3v) =', len(cat3v))
print('len(cat3id) =', len(cat3id))
cat2.match(cat3v, ExactMatcher('survey2_id', cat3id.t['survey2_id']))

len(cat3v) = 110
len(cat3id) = 110
[match] "cat3v" matched to "cat2": 110/150 matched.

<Data 'cat2'>

You can also match data with thier coordinates:

from pyttop.matcher import SkyMatcher
import astropy.units as u
cat1.match(cat2, SkyMatcher(unit=u.deg, unit1=(u.h, u.deg))) # RA for cat1 is dms; RA for cat2 is hms.

[SkyMatcher] Data cat1: found RA name 'RA' and Dec name 'Dec'.
[SkyMatcher] Data cat2: found RA name 'RA' and Dec name 'Dec'.

[match] "cat2" matched to "cat1": 90/100 matched.

<Data 'cat1'>

For more information on SkyMatcher, use help(SkyMatcher).

To remove all matches to cat1, use:

# cat1.reset_match()

To unmatch cat2 from cat1:

cat1.unmatch(cat2)

[match] "cat2" unmatched to "cat1".

For SkyMatcher, you can also explore the distribution of minimum sky distances, i.e. the distance to the nearest object in cat2 for each object in cat1:

skymatcher = SkyMatcher(unit=u.deg, unit1=(u.h, u.deg))
skymatcher.explore(cat1, cat2)
cat1.match(cat2, skymatcher)

[SkyMatcher] Data cat1: found RA name 'RA' and Dec name 'Dec'.
[SkyMatcher] Data cat2: found RA name 'RA' and Dec name 'Dec'.

[match] "cat2" matched to "cat1": 90/100 matched.

<Data 'cat1'>

../_images/565b1da3ec78d6021f90b2f4c103977681c8d033e1fa3b47fddc53d3c50aa485.png

Defining custom matchers#

Note: This part is for advanced users. If you are new to this package, you may skip this part for now.

You may also define your own matchers. A macther class should be defined like this:

class MyMatcher():
    def __init__(self, args): # 'args' means any number of arguments that you need
        # initialize it with args you need
        pass
    
    def get_values(self, data, data1, verbose=True): # data1 is matched to data
        # prepare the data that is needed to do the matching (if necessary)
        pass
    
    def match(self):
        # do the matching process and calculate:
        # idx : array of shape (len(data), ). 
        #     the index of a record in data1 that best matches the records in data
        # matched : boolean array of shape (len(data), ).
        #     whether the records in data can be matched to those in data1.
        return idx, matched

Merging catalogs#

Match tree#

If B is matched to A, I call A as the child data of B, and B as the parent data of A.

Say B, C are matched to A, and D is matched to B. Then B, C are children of A, and D is child of B. When we try to merge everything into A (i.e. merge the information in A’s chilren, grandchildren, etc. into A), it may be useful to see all of its children/grandchildren, or what I call the match tree:

cat1.match_tree(detail=False)

Names with parentheses are already matched, thus they are not expanded and will be ignored when merging.
---------------
cat1
:   cat4
:   :   (cat1)
:   cat2
:   :   cat3v
---------------

From the match tree we may see that cat4 and cat2 are matched to cat1 and cat3v is matched to cat2. Although cat1 is also matched to cat4, this match is a duplication in this match tree, and will be ignored when merging everything (cat4, cat2 and cat3v) into cat1.

For more information on how they are matched:

cat1.match_tree(detail=True)

Names with parentheses are already matched, thus they are not expanded and will be ignored when merging.
---------------
cat1 [base]
:   cat4 [ExactMatcher("survey1_id", "survey1_id")]
:   :   (cat1) [ExactMatcher("survey1_id", "survey1_id")]
:   cat2 [<SkyMatcher with thres=1>]
:   :   cat3v [ExactMatcher("survey2_id", "survey2_id")]
---------------

For example, we may also use cat4 as the base catalog:

cat4.match_tree(detail=False)

Names with parentheses are already matched, thus they are not expanded and will be ignored when merging.
---------------
cat4
:   cat1
:   :   (cat4)
:   :   cat2
:   :   :   cat3v
---------------

Catalog merging#

Now we can merge everything possible to be merged into cat1:

merged_cat = cat1.merge(outname='my_merged_catalog')
print("Name:", merged_cat.name)
merged_cat.t

Names with parentheses are already matched, thus they are not expanded and will be ignored when merging.
---------------
cat1
:   cat4
:   :   (cat1)
:   cat2
:   :   cat3v
---------------
[merge] merged: cat1, cat4, cat2, cat3v
Name: my_merged_catalog

Table length=37

survey1_id_cat1	RA_cat1	Dec_cat1	A	B	id	survey1_id_cat4	i	j	survey2_id_cat2	RA_cat2	Dec_cat2	a	b	survey2_id_cat3v	x	y	class1
int64	float64	float64	int64	int64	int32	int32	int32	int32	int32	float64	float64	int32	int32	int64	int64	int64	str8
0	134.8344427850505	-87.17137328819392	1648	312	61	0	4612	3147	94	8.98895958611857	-87.17137328819392	3191	50	94	671	583	Type II
1	342.2571503075698	-32.72306298625976	5539	35	34	1	2740	488	90	22.817134875833332	-32.72306298625976	5383	38	90	288	458	Type II
2	263.51781905210584	-61.7079617031306	9637	172	8	2	3568	6944	103	17.56785302820989	-61.7079617031306	943	16	103	523	685	Type II
7	311.82341247897665	-22.000397531125614	98	399	64	7	5238	2190	101	20.78823016812825	-22.000397531125614	5005	48	101	223	116	Type III
9	254.90612800657638	-83.07180811540863	2961	459	52	9	891	5450	113	16.993742542070294	-83.07180811540863	1308	7	113	515	659	Type II
11	349.16754677831796	-75.4900841471396	4433	470	7	11	6116	6083	53	23.277828530405216	-75.4900841471396	3495	28	53	674	272	Type III
16	109.52720746543358	-17.669513079079692	6944	287	36	16	1239	1347	83	7.3018168152337175	-17.669513079079692	2490	16	83	569	462	Type III
19	104.8424904712951	-41.459198227591436	4911	341	63	19	1198	4806	104	6.989499173053064	-41.459198227591436	4425	73	104	556	785	Type III
20	220.2670421000566	-17.33038603523437	3987	50	11	20	2203	5406	78	14.68447580405005	-17.33038603523437	3079	25	78	473	165	Type I
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
69	355.2792971761862	-36.81963511305824	8319	91	35	69	2289	7114	46	23.685285687773018	-36.81963511305824	2079	1	46	257	590	Type II
77	26.65607462427253	-55.193818832951635	3863	278	23	77	281	574	110	1.7770800016118025	-55.193818832951635	3452	79	110	221	394	Type II
78	129.04766227593814	-5.694301013693888	7455	126	0	78	7029	6887	42	8.603179857588493	-5.694301013693888	2482	91	42	195	720	Type III
79	41.7128614290467	-77.6231150268606	4014	392	55	79	9556	821	0	2.780858910895596	-77.6231150268606	474	65	0	39	488	Type II
83	22.88100610296851	-11.039458195711703	154	121	33	83	8467	3230	96	1.5254003596640098	-11.039458195711703	3983	97	96	570	83	Type III
86	262.6582242017031	-16.45000198189058	8096	289	19	86	3953	3291	43	17.510545262394547	-16.45000198189058	951	61	43	39	21	Type II
88	319.39658732747756	-42.33144794795942	7343	125	32	88	4585	6732	75	21.293101039889653	-42.33144794795942	1823	61	75	680	773	Type II
93	202.05979112501865	-33.020868845405886	537	236	18	93	1939	3673	136	13.470648167799785	-33.020868845405886	5399	25	136	241	788	Type I
98	9.150885627874267	-10.162221816139436	5625	486	14	98	1925	3124	81	0.6100575664012837	-10.162221816139436	2806	72	81	413	520	Type III
99	38.8409137175896	-19.811200872813856	2950	191	17	99	956	2408	69	2.589396089660706	-19.811200872813856	4873	32	69	277	716	Type II

Note that columns with the same names are renamed by the name of the Data objects. You may also check that the match is indeed correct.

We may now save the merged catalog for later use:

merged_cat.save('samples/output/merged_cat', overwrite=True) # you don't need overwrite=True if file 'samples/output/merged_cat.data' does not exist
# load with:
# merged_cat = Data.load('samples/output/merged_cat.data')

Source of a column#

Sometimes we forget the name of the data from which a column is merged. For example, we are sure that the column named 'A' is merged from somewhere, but cannot recall the name of that dataset. We can use the from_which() method:

merged_cat.from_which('A')

'cat1 (samples/catalog1.csv)'

The column 'A' indeed comes from cat1, which is loaded from "samples/catalog1.csv".

WARNING. The from_which() method has several limitations:

In some cases, the software cannot decide the source of the column, and from_which() will return an empty string.
Direct operations on the data table (astropy.table.Table), especially adding columns to the data table using the values of other columns, can result in incorrect results of from_which(). For example, instead of:

# WRONG:
merged_cat.t['A+i (wrong)'] = merged_cat.t['A'] + merged_cat.t['i'] # adding a new column
merged_cat.from_which('A+i (wrong)') # wrong result

'cat1 (samples/catalog1.csv)'

use:

# CORRECT:
merged_cat['A+i (right)'] = merged_cat['A'] + merged_cat['i'] # adding a new column
merged_cat.from_which('A+i (right)') # right result: this column is added by the user

'user-added (set by user)'

Merging options#

Maybe you want to keep records that cannot be matched to cat3v and only want to merge subsets of columns from the catalogs:

merge_columns = { # specify columns to be merged
    'cat1': ['survey1_id', 'RA', 'Dec'],
    'cat4': ['i', 'j'],
    'cat2': ['survey2_id'],
    'cat3v': ['class1'],
    }

keep_unmatched = ['cat3v'] # keep records that cannot be matched to cat3v

another_merged_cat = cat1.merge(keep_unmatched=keep_unmatched, 
                                merge_columns=merge_columns) # use default outname
print("Name:", another_merged_cat.name)
another_merged_cat.t

Names with parentheses are already matched, thus they are not expanded and will be ignored when merging.
---------------
cat1
:   cat4
:   :   (cat1)
:   cat2
:   :   cat3v
---------------
[merge] entries with no match for cat3v is kept.
[merge] merged: cat1, cat4, cat2, cat3v
Name: (cat1).MATCH(cat4, cat2, cat3v)

Table length=51

survey1_id	RA	Dec	i	j	survey2_id	class1
int64	float64	float64	int32	int32	int32	str8
0	134.8344427850505	-87.17137328819392	4612	3147	94	Type II
1	342.2571503075698	-32.72306298625976	2740	488	90	Type II
2	263.51781905210584	-61.7079617031306	3568	6944	103	Type II
4	56.16671055927715	-8.319017346651634	5342	1667	122	--
5	56.158027321032954	-67.56369937660125	3192	986	92	--
7	311.82341247897665	-22.000397531125614	5238	2190	101	Type III
9	254.90612800657638	-83.07180811540863	891	5450	113	Type II
10	7.410417946488881	-63.922369237760876	1969	1388	14	--
11	349.16754677831796	-75.4900841471396	6116	6083	53	Type III
...	...	...	...	...	...	...
77	26.65607462427253	-55.193818832951635	281	574	110	Type II
78	129.04766227593814	-5.694301013693888	7029	6887	42	Type III
79	41.7128614290467	-77.6231150268606	9556	821	0	Type II
83	22.88100610296851	-11.039458195711703	8467	3230	96	Type III
86	262.6582242017031	-16.45000198189058	3953	3291	43	Type II
88	319.39658732747756	-42.33144794795942	4585	6732	75	Type II
93	202.05979112501865	-33.020868845405886	1939	3673	136	Type I
96	188.18381857751785	-24.66398890167845	1742	5773	17	--
98	9.150885627874267	-10.162221816139436	1925	3124	81	Type III
99	38.8409137175896	-19.811200872813856	956	2408	69	Type II

You can also specify the columns to be ignored during merging (the ignore_columns argument of merge()); see help(Data.merge) for details.

You may also set the depth for match_tree and merge methods. Setting depth to 0 means only keeping the base catalog itself.

cat4.match_tree(depth=2)

Names with parentheses are already matched, thus they are not expanded and will be ignored when merging.
---------------
cat4 [base]
:   cat1 [ExactMatcher("survey1_id", "survey1_id")]
:   :   (cat4) [ExactMatcher("survey1_id", "survey1_id")]
:   :   cat2 [<SkyMatcher with thres=1>]
---------------

cat4.match_tree(depth=0)

Names with parentheses are already matched, thus they are not expanded and will be ignored when merging.
---------------
cat4 [base]
---------------

cat4.merge(depth=1).t

Names with parentheses are already matched, thus they are not expanded and will be ignored when merging.
---------------
cat4
:   cat1
---------------
[merge] merged: cat4, cat1

Table length=70

id	survey1_id_cat4	i	j	survey1_id_cat1	RA	Dec	A	B
int32	int32	int32	int32	int64	float64	float64	int64	int64
0	78	7029	6887	78	129.04766227593814	-5.694301013693888	7455	126
1	33	6099	3952	33	341.59879341119995	-59.614634573673484	3436	325
2	35	9499	4683	35	291.023045321926	-60.911736118132026	5895	53
3	5	3192	986	5	56.158027321032954	-67.56369937660125	2557	263
4	38	3178	1863	38	246.32388954437647	-57.273335785863544	4893	100
5	41	6052	4440	41	178.26368764005727	-67.33959337571723	5600	52
6	57	2734	4909	57	70.55383047089227	-21.45423462041542	606	307
7	11	6116	6083	11	349.16754677831796	-75.4900841471396	4433	470
8	2	3568	6944	2	263.51781905210584	-61.7079617031306	9637	172
...	...	...	...	...	...	...	...	...
60	93	7519	7232	93	202.05979112501865	-33.020868845405886	537	236
61	0	4612	3147	0	134.8344427850505	-87.17137328819392	1648	312
62	4	9710	4089	4	56.16671055927715	-8.319017346651634	8445	320
63	19	1198	4806	19	104.8424904712951	-41.459198227591436	4911	341
64	7	5238	2190	7	311.82341247897665	-22.000397531125614	98	399
65	42	6089	916	42	12.379867601478622	-45.24763446968531	7996	59
66	75	6257	3986	75	262.4425804947554	-74.30702138955077	1895	116
67	0	7705	2875	0	134.8344427850505	-87.17137328819392	1648	312
68	56	3719	6748	56	31.85730073869102	-29.50780073347093	301	423
69	63	302	5592	63	128.43119760969213	-41.78027843327174	980	157

Things to be noted#

This package does not support matching multiple records to a single record in the base data. For example, table2 below has two records with the same survey_id:

table1 = Data({'survey_id': [0, 1, 2], 'value': ['A', 'B', 'C']}, name='t1')
table1.t

Table length=3

survey_id	value
int64	str1
0	A
1	B
2	C

table2 = Data({'table2_id': [0, 1, 2], 'survey_id': [0, 1, 0]}, name='t2')
table2.t

Table length=3

table2_id	survey_id
int64	int64
0	0
1	1
2	0

If you match table2 to table1 by survey_id using ExactMatcher, the first exact match in table2 will be used:

table1.match(table2, ExactMatcher('survey_id', 'survey_id')).merge().t

[match] "t2" matched to "t1": 2/3 matched.
Names with parentheses are already matched, thus they are not expanded and will be ignored when merging.
---------------
t1
:   t2
---------------
[merge] merged: t1, t2

Table length=2

survey_id_t1	value	table2_id	survey_id_t2
int64	str1	int64	int64
0	A	0	0
1	B	1	1

If you wish to keep the records with the same sruvey_id in table2, you may match table1 to table2 instead of matching table2 to table1:

table2.match(table1, ExactMatcher('survey_id', 'survey_id')).merge().t

[match] "t1" matched to "t2": 3/3 matched.
Names with parentheses are already matched, thus they are not expanded and will be ignored when merging.
---------------
t2
:   t1
:   :   (t2)
---------------
[merge] merged: t2, t1

Table length=3

table2_id	survey_id_t2	survey_id_t1	value
int64	int64	int64	str1
0	0	0	A
1	1	1	B
2	0	0	A

Or you may merge these records (with the same sruvey_id) before matching and merging the catalogs.

Tutorial on matching

Contents

Tutorial on matching#

Data I/O#

Reading and writing Table-like data#

Reading and writing an pyttop.table.Data object#

Matching#

Matching with a built-in matcher#

Defining custom matchers#

Merging catalogs#

Match tree#

Catalog merging#

Source of a column#

Merging options#

Things to be noted#

Reading and writing `Table`-like data#

Reading and writing an `pyttop.table.Data` object#