SQL Primer: querying a single table
Here is the list of use cases we are going to deal with:
Query Gaia DR2 by source_id
Formatting the output
Cone searches, simple cross-matching
Optimizing queries: WHERE clauses and Indexes
Summary
SQL
SQL (Structured Query Language) is the de-facto standard for querying relational databases such as Postgres, MySQL / MariaDB, Oracle, Microsoft Server and many more. While it is supposed to be standardized, in real life each and every database comes with a bunch of extensions and extra commands. We are currently using PostgreSQL 10.
The single most important command is the SELECT statement Its basic structure is like this
SELECT <columns>
FROM <tables>
[WHERE <conditions>]
[ORDER BY <columns> [ASC | DESC]]
[LIMIT <number>]
; -- This is a comment: SQL statements are finished by a semicolon
FROM, WHERE, ORDER and LIMIT are called clauses in SQL-lingo, traditionally they are uppercase, but postgres allows you to type them lowercase as well. WHERE, ORDER BY and LIMIT are optional - which is why they are in "[]". The order can either be ascending or descending and you are allowed to mix them:
ORDER BY vmag ASC, teff DESC
is perfectly valid and will sort the output by increasing V magnitude and decreasing effective temperature for stars having the same V magnitude.
LIMIT 100 or so is something I highly recommend while testing long running queries. Otherwise you might wait quite some time for a query to complete only to find out that it does not give you the result you expected.
Each SQL statement is finished by a semicolon ";". One of the common errors is to forget the semicolon, run the command from the psql-shell and then wondering why nothing is happening. Now, the statement is not finished without the semicolon and Postgres waits for you to finalize the statement.
For the rest of this page we will mainly deal with the SELECT statement and its clauses. There are many more than you have seen, a lot of them you are not likely to ever use. The full documentation is here - be warned: it is not that trivial.
Querying a single table, formatting the output, cone searches
Query Gaia DR2 by source_id
Assume you are interested in a star for which you got a Gaia DR2 source_id from somewhere. Mine happens to be 4781196115469953024 and is a close by star with 3 planets: one Super-Earth and two Mini-Neptunes - details are in TOI-270. I want to see what Gaia knows about it:
SELECT * -- the asterisk * is an abbreviation for "all columns from all tables"
FROM gaiadr2
WHERE source_id = 4781196115469953024;
It is easy to run from the shell: just start "dbuser" (or "dbdemo") as we did in First Contact, copy and paste the select statement, semicolon included and hit enter. The output will be longer than fits on your terminal window and look pretty nasty. To get that a bit better, please copy and paste the query to a text file named "toi270.sql". Then run (you can use dbdemo instead of dbuser in all queries)
dbuser -f toi270.sql > output.txt
and inspect output.txt with your favourite editor (hint: it should allow to inspect long lines, "kwrite output.txt" and hitting F10 will do the trick). The result is a nicely formatted table with column-names as headers and we find it is a rather bright star (phot_g_mean_mag = 11.63), close-by star (parallax = 44.46 masec = 22.5 pc).
Assume we want to look up a few other stars as well, like Pi Mensae (source_id = 4623036865373793408) and famous HD 209458 (source_id = 1779546757669063552). This time we just want to know the ID, Gaia magnitude, parallax and distance and we want to sort by magnitude. The query then looks like this:
select source_id,
phot_g_mean_mag as gaiamag, -- rename column
parallax,
1000.0 / parallax as distance -- compute value and name it
from gaiadr2
where source_id in (4781196115469953024, 4623036865373793408, 1779546757669063552)
order by gaiamag;
Cut and paste into "several.sql", then run
dbuser -f several.sql
The output is short enough to fit on your screen and should look like this:
source_id | gaiamag | parallax | distance
---------------------+---------+------------------+------------------
4623036865373793408 | 5.49067 | 54.7052418080388 | 18.2797839283667
1779546757669063552 | 7.50866 | 20.6745057063176 | 48.3687501024232
4781196115469953024 | 11.6306 | 44.4564750048542 | 22.4939111769615
(3 rows)
It certainly looks nice, but is impractical if you want to use your results in other programs. For that csv-files (Comma Separated Values) come in handy, because almost all programs understand or can import them. We just have to get the output into this format.
Formatting the output
To csv-format the results, we use the COPY statement and nest our select in it. To nest statements into other statements is pretty common for SQL, we will see more of it later on. For now copy and paste this statement into "several_csv.sql":
COPY
(
select source_id,
phot_g_mean_mag as gaiamag, -- rename column
parallax,
1000.0 / parallax as distance -- compute value and name it
from gaiadr2
where source_id in (4781196115469953024, 4623036865373793408, 1779546757669063552)
order by gaiamag
)
TO stdout
WITH csv header delimiter ',';
I typed the keywords of the select statement in lowercase here (legal in Postgres) and the new COPY, TO and WITH in uppercase to make them stand out a bit. Please note:
- the select statement is enclosed in ()
- the copy statement will write to stdout, include a header row and use comma as separator (if you want to have it in Starbase-format replace the comma by a tab - a real tab, no \t or so
- I could have used a named output file like "TO 'several.csv'". The reason I use stdout: this way my queries run from everywhere and redirect them to a file. I do not need to remember to rename the output file each and every time. It is just more practical and smooth for me.
- PITFALL: the semicolon is not at the end of the select, but at the end of the COPY statement, because only after the copy this SQL-statement ends!
Run
dbuser -f several_csv.sql > several.csv
and examine the output file. I encourage you to read the documentation of the COPY statement - don't worry, it is far more easy than the select statement. Especially look at the options if you want to format your output differently. And if you ever need to import a csv-file, you can use the "COPY ... FROM" variant. Although simple, the copy command offers quite some flexibility.
To complete the story: the Postgres shell psql has its own "\copy" command which internally uses the SQL COPY statement but allows you to post-process exported and pre-process imported files by running programs over them.
Cone searches, simple cross-matching
Although I am pretty sure that almost always you will have the one or other ID from any of the catalogs in flemingdb (Gaia, TIC, 2MASS, PanSTARRS, ...), it may happen that sometimes you just know the coordinates (RA and Dec). What then?
You could try "SELECT ... FROM ... WHERE (ra = 5.15 and dec = -25.74)" but most likely this will fail, because if the database-value is not exactly equal to the value you got, the record will not match the WHERE clause. One way out is using small enough intervals like so:
select ...
from ...
WHERE ra BETWEEN 5.14 AND 5.16 AND dec BETWEEN -25.75 AND -25.73 -- ra BETWEEN x AND y means: ra >= x AND ra <= y
That will work, but there is a better way. We have added the q3c-plugin to postgres which provides an easy and fast way to do cone-searches and more. From the very first query we ran I take ra and dec for TOI-270 (RA 68.4160, Dec -51.9574). This time I want to know if there is anything around it that is bright enough to have created a false planetary signal (by being an eclipsing binary for example) and this time I am going to use pgAdmin4.
Remember: you get the Query Tool by clicking Tools on the blue pgAdmin line and selecting the first entry (Query tool). Query and result look like this:
For your convenience: here is the cut and paste version of the query (you run it by clicking the lightning symbol or pressing F5):
select 3600 * q3c_dist(ra, dec, 68.4160, -51.9574) as rdist,
source_id,
phot_g_mean_mag as gaiamag,
phot_g_mean_mag - 11.63 as deltamag
from gaiadr2
where q3c_radial_query(ra, dec, 68.4160, -51.9574, 200.0 / 3600)
order by rdist;
It think you might need some explanations with this one:
- important: all coordinates in flemingdb are in decimal degrees
- q3c_dist computes the radial distance on the sky from the coordinate columns (ra, dec) to a reference point (here 68.4160, -51.9574, the approximate coordinates of TOI270. I name the radial distance "rdist" by using "AS rdist" (we have seen that trick already)
- because the coordinates are in degrees the result is in degrees. I want it in arcseconds, so I multiply by 3600
- just before the FROM clause I subtract 11.63, the gaiamag of TOI270 which gives the magnitude difference I name as deltamag
- the WHERE clause contains the radial query (q3c_radial_query). The parameters are the same as for q3c_dist, just with an additional search-radius at the end. The TESS pixels are big (20 asec), so I query for about 10 pixels or 200 asec seach-radius
- finally I order by rdist (the radial distance from the first bullet point)
Now let's look at the result. The first row is TOI270 itself, so that is fine. The star in the second row is 17 asecs away, so it could end in the same pixel. But: it is almost 9 magnitudes fainter than TOI270 and thus too faint to fake a planetary signal, even if the second row star is an eclipsing binary. Same thing for the next stars, they are too faint and increasingly far away to cause trouble. Just to be sure I probably would want to have a closer look at the star in the third row.
What is this good for? Now, assume you plan to take a dozen Binospec spectrae at MMTO and you want to know if there could be any blending issue with one of them, then running queries like this for a dozen stars is doable. You can write them all in one sql file and run them using the dbuser or dbdemo command. If you get a list of coordinates from a colleague and need to match them against Gaia or any other catalog in flemingdb: this is how it basically works. Constructing the Candidate Target List for TESS I ran 20 million of these queries and then computed the light contamination for each and every star. Now, that would be either a very long sql-file or it requires some Python programming - I will cover the latter in a later tutorial.
As an intermezzo please make you way in the tree on the left to
Server -> flemingdb -> Databases -> flemingdb -> Schemas -> gaia2 -> Tables -> gaiadr2 -> Partitions
You should see something like this (I opened the indexes for the first partition "gaiadr2_ra000" as well):
Means: in real database life gaiadr2 is not just one massive table with 1.7 billion stars, but a "partitioned" table with 21 sub-tables. I sliced them by RA as you might guess from the names - 20 deg by default, 10 deg around the galactic center and we just queried all 21 of them at the same time by just querying gaiadr2. For our radial query we did not necessarily need to do so, because we know in which RA-slice the star is. RA = 68.4160 means it is in gaiadr2_ra060. If we query just that one slice just like so:
select 3600 * q3c_dist(ra, dec, 68.4160, -51.9574) as rdist,
source_id,
phot_g_mean_mag as gaiamag,
phot_g_mean_mag - 11.63 as deltamag
from gaiadr2_ra060
where q3c_radial_query(ra, dec, 68.4160, -51.9574, 200.0 / 3600)
order by rdist;
it will run a bit faster. For radial queries the effect is not huge, because gaiadr2 is sliced by RA, but for any other query - like when you query for stars in a certain magnitude range and in a region you intend to observe, the difference can be very substantial (more than a factor of 10). In order to understand why, we need to understand a bit better what databases are doing when we are querying them. Luckily pgAdmin provides a pretty easy way to do so and with that we move on to the last chapter in this tutorial.
Optimizing queries: WHERE clauses and Indexes
In this chapter we look a bit behind the scenes and learn what we should keep in mind when formulating WHERE clauses, thus conditions the star should match. Luckily using pgAdmin that is pretty easy to do. We revert back to our first query:
SELECT * -- the asterisk * is an abbreviation for "all columns from all tables"
FROM gaiadr2
WHERE source_id = 4781196115469953024;
Cut and paste into pgAdmins query editor (Tools, Query Tool, upper part), but instead of running it, click on the little hand symbol or press F7 for "Explain". In the result pane a little graphic shows up and mousing over the upper left symbol you should see something like this (I am cutting the screenshots to the necessary part from now on, you should be familiar with pgAdmins layout meanwhile):
On the left you see a column of named symbols: gaiadr2_ra000_source_id_idx, gaiadr2_ra020_source_id_idx, ... all the way down to ra340. As all names end with _idx we conclude that postgres is going to query the index on source_id for each partition of the gaiadr2 table. That is good, a sequential scan over the full gaiadr2_ra000 partition would take much, much longer.
Mousing over the first symbol we get some more detailed information. The most important are:
- Node type - Index Scan (you do not want to see sequential scan here unless the table is short)
- Startup cost - 0.44
- Total cost - 4.46
- Plan Rows - 1 (seems Postgres already knows that we query for just one star)
Our conclusion was correct: postgres will perform an index scan. The costs are relative numbers, basic rule: the lower, the better. The arrow width is telling us something as well: the smaller, the better. Once you see wide arrows coming from sequential scans it is definitely time to think about the conditions in the where clause again. The last symbol is "Append" which just collects the results from the 21 partitions and appends one after the other in no specific order. Again: because we know our star is in the ra060-partition, we could have queried just this partition and would see only the index scan for it. Modify the query and click the hand symbol again if you like.
We know from previous queries that the Gaia magnitude of TOI270 is 11.6306. So let us ask for 0.1 mags around it and let postgres explain again:
As you can see, we are going on the gmag-index here, the arrows got significantly wider amd the costs went up. "Plan Rows" is an estimate of how many rows would be in this ra000 partition. Higher costs and wider arrows are no wonder: we are asking for all stars in Gaia being in this magnitude range. Sure there is more than one.
We now change the magnitude range to be beyond the range of Gaia magnitudes. To protect ourselves from accidentially hitting the wrong button, I added a LIMIT 100 at the end. Using the hand symbol again, we get this:
The arrows got even wider and instead of running an index scan postgres would now run a Seq(uential) Scan over the whole partition (gaiadr2_ra000 etc, no more _idx at the end of the name). The total cost went up crazy and Plan Rows tell us it would deliver about 26 million rows just from the first partition. This is exactly what you do not want to see! Ever. If you really have to query all of Gaia, there are better ways to do it (using Python scripts which I will explain in another tutorial).
Important: if you ever start a query you did not mean to start, click the square left of the lightning symbol. This will tell postgres to cancel the query.
Ok, we revert to the original range (11.63 to 11.64) and add the source_id back in, so:
SELECT *
FROM gaiadr2
WHERE phot_g_mean_mag between 11.63 and 11.64 and
source_id = 4781196115469953024;
Without showing a screenshot: we are back to small arrows, index scan and an estimated 1 row for every partition. Postgres figured out that the source_id is the most restrictive condition and automatically optimized its query plan in that way. In general postgres is pretty clever in optimizing query plans, but with complex queries it sometimes needs some help - especially on partitioned tables. But as we know now what to look for - and what we do not want to see - it will be pretty easy to check.
We modify the query again by taking out the source_id, keeping the narrow magnitude range, but using Gaias BP magnitude this time (again: hand-symbol):
We are back to sequential scans of all partitions, crazy high total costs and thus to what we do not want to see. So, what happened here?
The answer is simple: while the Gaia magnitude is indexed, the BP magnitude is not. In other words: postgres does not have an index it could use. This is a typical case where you would want to contact us and ask us to add an index for BP - or send us your query and ask what the heck is going on. There will be cases where you would need an index that just does not exist. That is - now, not normal, but will happen. Indexing every column in Gaia would make the space we need for Gaia explode. So don't be shy, just ask the experts.
Finally let us see what happens if we query both magnitude ranges:
Although it is hard to see, the arrows got actually a lot smaller - we have only 3 horizontal symbols, the previous screenshot had 4, so the previous one was more squeezed. However, looking at the first symbol and mousing over it does revel the truth. We are back to index scans and the total cost estimate declined drastically. The last 2 rows in the gray box tell us what postgres is going to do. It figured out it has an index for gmag, so it will run an index scan over that and retrieve the resulting rows. Then it will filter those for the magnitude range in BP. The remaining rows from all partitions will be collected and appended to the result list which then is limited to 100 rows.
In fact LIMIT is a bit more clever than shown here. It will stop the query once it has collected 100 rows. But at least one partition might have to be finished querying and delivered 100 or more rows. So in effect the limit query above runs pretty fast - try it out if you like. The second effect is that limit prevents pgAdmin going crazy by receiving and rendering eventually millions of rows.
In the postgres documentation you will find the remark that LIMIT should be used together with an ORDER clause, because otherwise the set of rows you get is unpredictable. While the rows you see surely are unpredictable without an order, we are using LIMIT here for protecting us from doing things we really do not want to do. For checking if a query is syntactically correct and for looking at the query plan it does not play a role at all which 100 rows the query will deliver. Just in turn, running the above query with an ORDER clause would force postgres to collect all results from all partitions, then to order them and deliver just the top 100. This would blow up the runtime significantly - especially on big tables and result-sets. So keep in mind: For testing LIMIT without ORDER is just fine. In fact I often commented out an ORDER clause while testing queries, even if I put it back in for the final run.
There is still one open question: why on Earth do we not have an index on BP (and RP)? Answer: We did not need one. Flemingdb is an offspring of constructing the Tess Input Catalog (TIC). All 3 Gaia magnitudes are part of the TIC tables and we have it indexed there. TIC is part of flemingdb. In fact it links many, many catalogs to each other by using the TICID. This will avoid a lot of cross-matching by coordinates, because either Gaia or TIC has already done it.
As usual: do not forget to log out of pgAdmin: click on the email address at the right side of the blue pgAdmin line and choose Logout. Ignore any warning about unsaved queries and leave the page.
Summary
- We had a first look at the SELECT statement and learned about its FROM, WHERE, ORDER and LIMIT clause.
- We did some simple computations within a query itself. In fact Postgres supports a variety of Mathematical Functions and Operators. And as we are at it: there is a variety of String Functions and Operators as well.
- We saw how to use LIMIT in order to quickly test queries which would produce a lot of output rows if not limited.
- We ran a cone search and thus know how we can cross-match given coordinates with Gaia DR2 or any other catalog.
- We used pgAdmin to run and interpret EXPLAIN for us. For shell lovers: you can run it in the psql shell, but the output is much harder to interpret. Those who want to have a deeper look into it should read Using EXPLAIN.
- We learned what we want to see using EXPLAIN (index scans, bitmap scans, low total costs) and more important what we do not want to see (sequential scans on big tables).
- We learned that indexing makes a huge difference in runtime on big tables.
- We saw that Postgres can deal with a mixture of indexed and not-indexed columns in the WHERE clause. Rule of thumb: the most restrictive condition should always have an index. And of course: do not run a query where no column in the condition has an index unless the table is small ( < 1 million rows on flemingdb)
As usual: please send questions and comments to DBAdmin.