Importing your stars
Most of you will have one or more lists of stars you are actively working on. If you have a dbuser or personal account you can bring them into FlemingDB, query them for say color or magnitude ranges and - needs be - cross-match them to any table we already have and thus get more information about them. If you have a dbuser account, be aware of the fact, that your stars will end in a table located in the schema "tmp" and thus be accessible for everybody. If you have a personal account, nobody but you and your collaborators will be able to see them - you might want to replace the schema tmp used for the examples here by the name of your personal database schema.
While everybody should read the prerequisite, which method to use depends on the form your data is in:
Prerequisite
Importing CSV files
txt2pg: general import of text-files, mass imports
Importing fits files
Summary
Prerequisite
There are 2 questions you need to answer before you get started:
- What are the columns and their data-types?
- What is the primary key?
Every table should have a primary key, although Postgres is not enforcing it. A primary key
- must be unique
- should be integer or long integer
So Gaias source_id, the ticid or Sloans objid serve fine. If you don't have this sort of ID: no problem, postgres can automatically create one. That will be a counter, starting at 1, which is automatically incremented every time you add a row. Any primary key of this kind will never be used again if it is deleted - and that is how it should be (you can override it though).
Importing CSV files
CSV (Comma Separated Values) is not meant literally here. Tab or any other character as separator will serve as well. Usually strings will have to be quoted by single or double quotes, but we will use a way here, that works even with unquoted strings. If you have your information in this form, importing them is a 2-step process:
- create a table in the schema "tmp"
- import the data
We are going to import a csv-file you can copy to your working directory from here:
cp /home/mmtobs/local/pgtemplates/hvs_survey-1.csv .
The file contains MMT hypervelocity stars, the publication is here. The first few lines look like this:
rahms decdms rv rv_e u g r i u_e g_e r_e i_e
0:01:31.115 26:46:55.06 -363.75 10.97 20.329 19.671 19.920 20.167 0.067 0.021 0.029 0.037
0:02:33.817 -9:57:06.85 -86.80 9.04 19.383 18.578 18.867 19.052 0.034 0.021 0.034 0.024
0:05:28.141 -11:00:10.07 -128.52 6.24 20.324 19.271 19.510 19.703 0.069 0.042 0.021 0.030
The coordinates are in HMS respectively DMS format. Means: for now we will have to treat them as strings or text. We will convert them into decimal degrees during the programming tutorial. The rest are single precision floating point numbers. We do not have an obvious ID for out primary key. Means: we will have to create one. But first we have to create the table.
Log into pgAdmin with your email-address, then go along the tree-view to "flemingdb --> Schemas --> tmp --> Tables" and right-click on Tables. This will open a context-menu which you use to get to "Create --> Table". Your window now should look like this:
My user account on CF-managed machines is mpaegert, so I name the table "hvs_mpaegert". Please follow the same naming convention for your tables:
freetochoose_youraccountname
Next activate the Columns pane and click on the + sign at the right to add one column after the other - with exactly the name they have in your file and in exactly the same sequence as they are stored. Once you are done it should look like this:
The first column is our primary key and thus it is named "pk" - just my convention, you can name it differently. The data type is "bigserial" which means it is a 64 bit integer counter starting at 1. "Serial" would be a 32 bit counter - that would be enough for our table. pk must not be NULL (means: have no value) and of course it is the primary key, so the last two flags are switched on. Because in this example we do not allow stars without coordinates, rahms and decdms are switched to "Not NULL" as well. Click "Save" and your table will be created.
It is quite clumsy to click all the columns definitions together and there is a better way to do it. Open a Query Tool (Menu: Tools --> Query) and cut and paste the following command and change the xxx to your account-name on CF-managed machines:
CREATE TABLE tmp.hvs_xxx_todelete
(
pk bigserial NOT NULL,
rahms character varying(15) NOT NULL,
dechms character varying(15) NOT NULL,
rv real,
rv_e real,
u real,
g real,
r real,
i real,
u_e real,
g_e real,
r_e real,
i_e real,
CONSTRAINT hvs_xxx_todelete_pkey PRIMARY KEY (pk)
)
Execute by clicking the hand symbol and you have created another table. To see it you need to right click on "Tables" and select "Refresh" in the context menu of the tree-view. To delete the table we use the tree-view again. Right click on the table name and from the context menu select "Delete/Drop". After confirming that you want to drop the table, it will be gone.
If you need to rename the table, add columns or change the datatype of a column: Right click the table name and select "Properties" and the bottom of the context menu. Or click the ">" symbol next to the table name, then right click on Columns and select Create to add new columns. If you just want to edit one column: right click on the column name in the tree-view and select "Properties".
Time to get some values in. My table is named "hsv_mpaegert". While it is possible to import (or export) tables via pgAdmin, we will use pgsql and the dbuser command to actually run it. The reason is that pgAdmin imports from a path
/home/mmtobs/local/var/lib/pgadmin/storage/xxx_cfa.harvard.edu/
where xxx is your account name (exports will be at the same place). Usually your files are not there and using the psql shell with its "\copy" command is more flexible. The sql-file we use has only one command (replace 'hvs_xxx' with your table name):
\copy tmp.hvs_xxx (rahms, dechms, rv, rv_e, u, g, r, i, u_e, g_e, r_e, i_e) FROM 'hvs_survey-1.csv' DELIMITER ' ' CSV HEADER ;
Important: if you copy and paste to a file - say "hvs_import.sql" - make sure that there is a single tab-character between the single quotes after DELIMITER. Please note that we did not include the column "pk" in the list. Adding QUOTE '''' (4 single quotes) or QUOTE '"'will deal with quoted strings, but we do not need that here. Running the command with
dbuser -f hvs_import.sql
should result in "COPY 1127". To check if everything worked, right click the table name in the tree-view and select "View/Edit data --> First 100 rows". It should look like this:
I made the rv column a bit wider, to see the signs better.
txt2pg: general import of text-files, mass imports
txt2pg has some advantages over the standard import of csv-files. It is
- able to create the database table for you (but will need some help with the data-types)
- is more flexible for unusually formatted files
- able to import more than one file in one go.
We will use the MMT hypervelocity stars here and import them to a second table which we will later use in the programming tutorial. If you did not set your PATH and PYTHONPATH according to Setup your account, you will not be able to use txt2pg.
In that tutorial, we briefly called txt2pg already and got its help message back. the most important parameters from that message are:
-c CLINE line number with column names (default 1)
-f FLINE line number with format description (default 2)
-s SEP separator [default = any whitespace]
-t TNAME table name
--dbconnect=CONNECT database connection string
--dbschema=DBSCHEMA database schema
--fhelp print help for format codes and addons
We already have a line with the column names and it is the first line, so we can basically check that off. However, txt2pg needs a # sign as first character there and we will add that later. We do not have a line with format descriptions, so we will have to add that. Because tabs count as whitespace, the default for the column separator is ok as well. The connection string we will see in the completed command, the dbschema is tmp. "fhelp" promises to tell us more about the format descriptions, so lets try that:
txt2pg --fhelp
Format help
Each format element consists of a mandatory 1 character format code and
an optional addon. The code s may be followed by a number. Format codes are:
b - boolean
i - integer
l - long integer
f - float
d - double precision
t - timestamp without timezone (like 2019-11-20 18:49:20.123)
s - string as character varying
si - as s but with a maximum length of i characters
xi - fixed length format only: skip i characters
Any other code will be treated as postgres type text.
Addon codes:
p - primary key, implies NOT NULL
n - not null
u - unique not null
Means our hms and dms coordinates are s15 and everything else is f. As separator we use the same thing as for the values, so whitespace is ok. You can either copy your hvs_survey-1.csv to hvs_survey-1.txt or copy the edited table from /home/mmtobs/local/pgtemplates/hvs_survey-1.txt. The first few lines look like this now:
# rahms decdms rv rv_e u g r i u_e g_e r_e i_e
# s15 s15 f f f f f f f f f f
0:01:31.115 26:46:55.06 -363.75 10.97 20.329 19.671 19.920 20.167 0.067 0.021 0.029 0.037
0:02:33.817 -9:57:06.85 -86.80 9.04 19.383 18.578 18.867 19.052 0.034 0.021 0.034 0.024
0:05:28.141 -11:00:10.07 -128.52 6.24 20.324 19.271 19.510 19.703 0.069 0.042 0.021 0.030
Please note the added # as first character in the first two lines. Please note that we need to have the first 2 lines if and only if the table does not yet exist. If the table does exist already, we do not need the format description, because txt2pg already knows the data-types from the database itself. We might need the column names: if you are omitting columns or have them in a different order than in the database table.
We import the whole file again with this command (please adjust you table name behind -t and replace dbuser_password by the real password for dbuser (should be in your ~/.pgpass file):
txt2pg -t hvs_demo_coords --dbconnect="postgres://dbuser:dbuser_password@fleming.cfa.harvard.edu:5432/flemingdb" --dbschema=tmp hvs_survey-1.txt
('processing ', 'hvs_survey-1.txt')
Creating table hvs_demo_coords in schema tmp
CREATE TABLE "tmp"."hvs_demo_coords" (pk bigserial NOT NULL PRIMARY KEY , rahms CHARACTER VARYING(15) , decdms CHARACTER VARYING(15) , rv REAL , rv_e REAL , u REAL , g REAL , r REAL , i REAL , u_e REAL , g_e REAL , r_e REAL , i_e REAL );
(1127, ' lines found, ', 1127, 'lines imported in ', 0.396148681640625, ' seconds')
18:00:46.039115: 1 files in 0.39640164375305176 seconds = 0.39640164375305176 seconds / file
Done
We get a message about th file we process, that txt2pg is creating the table in schema tmp and which CREATE statement it uses. Please note that we get "pk" as primary key in the first column for free. It then tells us that it imported 1127 out of 1127 lines in 0.4 seconds and writes a final message having imported 1 file and how long that took.
If you suspect that we can import more than one file at a time, you are right. They must, however, have the same number of columns in the same order and be formatted identically. If you have a handful of files you want to import, you can do it like so:
txt2pg [options] file1.txt file2.txt file3.txt ...
If you have many files to import - say mystars_ra000.txt to mystars_ra359.txt - you need to create a file storing these filenames. "ls mystars_ra*.txt > filenames.txt" will do just fine. The first and only the first of them - mystars_ra000.txt - needs to have a column and format description as above if you want txt2pg to create the table for you. Adding a logfile the command would then look like
txt2pg [other_options] --ifnames=filenames.txt --lfname=myimportlog.txt
txt2pg will import the files and write an ongoing log to myimportlog.txt. This logfile is not buffered, so that you can always look how far it got by running a tail on it.
If you want to import only a limited number of columns out of the 360 files you got, you can use "--ignore=skip_these_columns.txt". The textfile is a space separated list of column numbers to ignore.
If you have a unusual column separator - say the pipe symbol "|" - use "-s" to tell txt2pg about it. Depending on the symbol you might need to escape it for keeping the Linux shell understanding it as a command to itself.
txt2pg is already pretty powerful - many of the tables in schema "catalogs" got imported this way. Behind the scene it is a Python script, so we can further adapt it if needs be.
Importing fits files
Many astronomical catalogs can be downloaded as fits-files only (fits tables). There are 2 ways to get them into your own database table:
- fits2pg.py, an importer much like txt2pg we talked about above
- convert it to a csv-file using fv
We will talk about both ways, but fits2pg is the more convenient one.
fits2pg: the fits-file importer
In Querying a single table we used the triple exoplanet system TOI-270 as an example. The light-curve for this and other system are downloadable from MAST. Conveniently I downloaded the light-curve for this system already and to to grab them run
cp -r /home/mmtobs/wpthree/fldbtutorials/tesslc/ .
This will create a tesslc sub-directory with 3 fits-files in it; "cd tesslc" will get you there and we are going to import them all. The first one directly, the remaining2 by using a trick for importing many fits-files at once (and into the same table). To inspect one of them run
fv tess2018263035959-s0003-0000000259377017-0123-s_lc.fits
(fv is /home/mmtobs/local/fv5.4/fv in case the above should not work, then you do not have /home/mmtobs/local/bin in your PATH) and inspect the table and its header.
"fits2pg -h" will provide us with a list of parameters we can use - for now we only need these:
--commit commit changes to database
--cpheader=CPHEADER filename with definition of information to copy from
fits header (default = None)
--dbconnect=DBCONNECT
database connection string: postgres://exampleuser:exa
mplepassword@examplehost:5432/exampledatabase
--dbschema=DBSCHEMA name of the url to use as database connection string
(default = tmp)
--lfname=LFNAME log file
-t TNAME table name
Dbconnect we already met for txt2pg, we will use the same. "-t" will be "tesslc"; please use a different name for your table, because tesslc already exists now and we will use this table in the plotting tutorial. "cpheader" is new. Now, it is common to store a lot of information in fits-headers we might want to have in the table as well. In the case of TESS light-curves the TICID we need to connect the light-curve information to TIC8 (TESS Input Catalog version 8, part of flemingdb) is stored in the the fits-header, not in the table. We need to tell fits2pg to add the TICID to every record it imports, because we want to know for which star it is. For this we create a file with a special format which for each entry looks like this (cpheader.txt is already in your tesslc directory):
{'TICID' : {'colname' : 'ticid',
'colidx' : 0,
'coltype' : 'bigint'
}
}
Basically these are nested key-value pairs enclosed in {}. We need a starting { and closing } as first and last character of the file. 'TICID' is the name of the header card we need. After the ":" comes the information for this card:
'colname' is a fixed string followed by a : and the name we want to give it in our table ('ticid')
'colidx' is a fixed string followed by a : and the index of the column we want to use (start with 0, then 1 etc)
'coltype' is a fixed string followed by a : and the postgres data-type (here 'bigint')
Please note the "," and the end of the first 2 lines - they separate the inner key : value pairs, all of them enclosed in the inner {}.
Last thing we need is "--commit", but why do we need it? Creating tables from textfiles we had full control over the data-types via the format line. Importing fits-files the program will convert the fits data-type to a postgres data-type. If we do not (!) set commit, fits2pg will only create the table but not actually import the data. This allows us to check and - if needed - adjust the postgres data-type. For reductions like "double precision --> real" or "bigint --> integer" this only can work if the table is empty. Once we are happy with the data-types, we can just run the same command again - this time with active commit - and fits2pg will take the data-type from the already existing postgres table (just like txt2pg does).
Feel free to run it in these 2 steps and check the results via pgAdmin4. Because I already know that the data-types are fine, we can run it in one step in this tutorial:
fits2pg --commit --lfname=firstimport.txt --cpheader=cpheader.txt --dbschema=tmp -t tesslc --dbconnect="postgres://dbuser:dbuser_password@fleming.cfa.harvard.edu:5432/flemingdb" tess2018263035959-s0003-0000000259377017-0123-s_lc.fits
The resulting logfile "firstimport.txt" contains the command for creating the table, the insert command and the report from actually importing the file. Looking at it via pgAdmin4 you will see this:
Please note that by default we get a "pk" as auto-incremented primary key in the first column (us --nopk if you do not want it). pk is followed by ticid, so the information we copied from the header. As we imported just one star, it is the same for all rows - usually we would import all of our light-curves into one table (which may be partitioned if it is > 400 million rows or so). "time", "timecorr" etc are imported from the fits-table.
We have 2 more tables to import and could just rerun the same command for the remaining files. However, catalogs sometimes come as hundreds to tenthousands of fits-files and we do not want to run the same command that many times.
A simple "ls *.fits > lightcurves.txt" will create us a file holding a list of all the filenames. There already is one in your tesslc directory which I edited a bit so that it looks like this:
# lightcurve for TOI 270.x
# tess2018263035959-s0003-0000000259377017-0123-s_lc.fits
tess2018292075959-s0004-0000000259377017-0124-s_lc.fits
tess2018319095959-s0005-0000000259377017-0125-s_lc.fits
Every line starting with a "#" will be ignored by fits2pg (same thing for empty lines). The important thing here is to comment out the first fit-file because we already got that into the database. Knowing that every files has about 20000 rows, we increase the block length to 25000, so that we can get it in one go (--block=25000). That said and modifying the logfile as well we run
fits2pg --commit --lfname=multiimport.txt --cpheader=cpheader.txt --dbschema=tmp -t tesslc --dbconnect="postgres://dbuser:dbuser_password@fleming.cfa.harvard.edu:5432/flemingdb" --ifnames=lightcurves.txt --block=25000
The last line in "multiimport.txt" tells us that we got another 37628 rows imported, in my case with an average of 6300 rows/s. Of course after importing the light-curve we would like to have a look at it. This is done using dbplot.
Convert fits to csv using fv
This section is obsolete by fits2pg, but kept for historic reason. The way described here is bumpy, but workable.
Another way to import fits-files with existing tools is to convert them to csv-files and import them as such. Either directly or using txt2pg. NASAs fits file editor is able to do so (fv). Unfortunately the default fv on CF-managed machines is broken (export does not work properly). We installed a working version on fleming, in order to use it create or edit the file ".alias" in your home directory and add this line (bash users will need a "=" after fv):
alias fv54 "/home/mmtobs/local/fv5.4/fv"
Run "source .alias" to load the definition. Copy the file /home/mmtobs/local/pgtemplates/gaiatest.fits to your current directory, run "fv54" and open the copied file. Click on All for HDU 1 and in the table window on File --> Export as text and click on Save. In the popup-dialog make sure that "User-defined separator" is marked. The default is " " (blank) which we have to change into ",' here because the first column (gaia_dr2_source.designation) does contain blanks itself. We can not have that and use blank as column separator at the same time. Note: you might try to use fv's csv-format, but that will quote every column and thus mark them as string. This is not what we want.
Finally click Save and load the new file into your favorite editor to verify that it is ok. If you want to use txt2pg and do not have a table already, you would have to add the column and format description.
This is certainly a bumpy ride, but doable if for whatever reason you do not want to use fits2pg.
Summary
- We created our first own table - either in tmp or in your personal schema. We did so via pgAdmin and say how to do it via a CREATE TABLE statement. The statement either runs via the Query Tool in pgAdmin or via dbuser.
- We imported the MMT Hypervelocity stars via the psql command \COPY.
- We imported them yet again using txt2pg (text to postgres), behind the scene a Python script developed for TIC.
- We learned that txt2pg can create tables automatically, but needs the column names in line 1 and a format description in line 2 of the first textfile to import.
- We noted that txt2pg can import several to arbitrary many files in one go.
- We introduced the fits-file importer fits2pg which works much like txt2pg, just that it takes the column definitions and data-types from the fits-file and in addition offers a way to copy information from the fits-header into the postgres table.
- We saw that the standard tool fv can be used to convert fits-files to a csv-format which can then be imported. This way is obsolete by fits2pg.
As usual: please send questions and comments to DBAdmin.







