Travel-time SQL

Description: Creating a database of seismic wave travel-times using SQLite3, for use in seismic tomography. More on GitHub.

Techniques: python SQL interaction, database creation, database manipulation

Tools: SQLite3, Python (pandas, sqlalchemy), Jupyter notebooks


The tomographic inversions I performed with the iMUSH project involved a lot seismic wave travel-times stored in a difficult-to-read form. In this small project I wanted to store them in a form that other people could access in the future. SQL seemed a good way to go. This post will essentially go through the steps sequentially.

Objective: Each travel-time observation was made for a specific seismic wave traveling from an event (earthquake or explosion) to a recording station. This breaks things into three primary tables that I’ll want to create, with the following columns:

  • Observation: [SourceID, Station, Phase, Uncertainty, FullArrivalTime]
  • Event: [SourceID, PNSNid, Longitude, Latitude, Depth, Longitude-uw, Latitude-uw, Depth-uw, FullEventTime]
  • Station: [Station, Longitude, Latitude, Depth]

The two locations for the Event table come about because I calculated new location with the data I generated, but wanted to keep the original locations determined by the Pacific Northwest Seismic Network (PNSN).

Existing data: There are five existing text files I’ll want to use to create my desired tables, with the following fields:

  • quakes.obs: [SourceID, Station, Phase, ArrivalTime, Uncertainty]
  • quakes.loc: [SourceID, Longitude, Latitude, Depth, EventTime]
  • combined.sta: [Station, Longitude, Latitude, Depth]
  • [dbname, orid, SourceID, method, Longitude-ant, Latitude-ant, Depth-ant, FullTime]
  • AntUWreview.txt: [PNSNid, dbname, orid, picker, Latitude-uw, Longitude-uw, Depth-uw]

Creating SQL database with Python

In order to start this process, I was using mostly Python. On my GitHub page there is a version of these steps (travelTimeDB.ipynb) that constructs the desired fields using Python, and creates the SQL database at the end. In this post, I’ll go over the steps I took that used SQL (summarized in travelTimeDB_sql.ipynb on GitHub), and ideally at the end the two versions would have identical databases.

However, initially I read the data files into Python pandas dataframes, and then write these into a SQLite database.

import pandas as pd
from sqlalchemy import create_engine

# define the directory that has the struct3dp data files

# These file names are always associated with an inversion
stafile='combined.sta' # the stations that recorded travel times
eqobsfile='quakes.obs' # the travel time observations for earthquakes
eqlocfile='quakes.loc' # the locations of earthquakes

# read into dataframes
sta_df=pd.read_csv(datadir + '/' + stafile, names=['Station','Longitude','Latitude','Depth'], sep=' ', skipinitialspace=True)
obs_eq_df=pd.read_csv(datadir + '/' + eqobsfile, names=['SourceID','Station','Phase','ArrivalTime','Uncertainty'], sep=' ', skipinitialspace=True)
loc_eq_df=pd.read_csv(datadir + '/' + eqlocfile, names=['SourceID','Longitude','Latitude','Depth','EventTime'], sep=' ', skipinitialspace=True)

# read in more files that provide some mapping between other files


datamap_df=pd.read_csv(datadir + '/' + mapfile, names=mapcols, sep=' ', skipinitialspace=True)
antUW_df=pd.read_csv('{}/{}/ANT/{}'.format(rundir,modelname,antUWfile), names=antUWcols, sep=' ', skipinitialspace=True, dtype={'orid': str})

Reading the text files into pandas dataframes is relatively straightforward with pd.read_csv(...), then I write those dataframes to SQLite tables with dfname.to_sql(...)

### add data to sqlite database
dbdir='/Users/ulberg/Documents/GitHub/travelTimeDB/DB' # make sure this directory is created before running
filename=dbdir + '/tt_db_sql.sqlite'

# create engine
engine=create_engine('sqlite:///' + filename)

### create tables in 'raw' state (with '_r'), with all columns
### will work on these in sql to create the final tables
# add station table
sta_df.to_sql('sta_r', con=engine, if_exists='replace', index=False) # convert dataframe to sqlite db, replace old version (DANGEROUS)
obs_eq_df.to_sql('obs_r', con=engine, if_exists='replace', index=False)
loc_eq_df.to_sql('loc_r', con=engine, if_exists='replace', index=False)
datamap_df.to_sql('map_r', con=engine, if_exists='replace', index=False)
antUW_df.to_sql('antUW_r', con=engine, if_exists='replace', index=False)

Explore tables using SQL

Now we get into the SQL part of the project. I mostly used the Mac command line tools, by opening the database with:

sqlite3 tt_db_sql.sqlite

Getting oriented: Here’s a few commands I used to check on the number of tables, column names, data values, etc… In the command line environment there are some useful commands that begin with a period, such as .tables, .header on, .mode columns. The header and mode commands will make the outputs easier to read.

.tables -- this will tell the table names in the database
antUW_r  loc_r    map_r    obs_r    sta_r 

It’s also useful to look at the first few rows of a table:

SELECT * FROM sta_r limit 5;
Station     Longitude   Latitude    Depth     
----------  ----------  ----------  ----------
MB05        -122.28102  46.62087    -0.641    
MB07        -122.04239  46.62378    -0.878    
MC06        -122.1572   46.55202    -0.77     
MC08        -121.96322  46.55335    -0.653    
MD05        -122.27817  46.52316    -0.718

Or get extra information about the columns of a table:

PRAGMA table_info(sta_r);
cid         name        type        notnull     dflt_value  pk        
----------  ----------  ----------  ----------  ----------  ----------
0           Station     TEXT        0                       0         
1           Longitude   FLOAT       0                       0         
2           Latitude    FLOAT       0                       0         
3           Depth       FLOAT       0                       0  

Create final station table

The sta_r table is already in the final form we want, so we could rename it, or create a new one with the final name ‘station’:

SELECT * FROM sta_r;

Work on the observation table

To complete the observation table, we’ll have to create a column with the FullArrivalTime, which is equal to the FullEventTime (from map_r) + ArrivalTime (from obs_r, in seconds) – EventTime (from loc_r, in seconds). It would be nice to do this directly by joining the tables with the SourceID column, but this can’t be done directly because there are duplicate values in the map_r table.

Show duplicate rows: This command will show some of the duplicate rows-

  SELECT SourceID, COUNT(*) FROM map_r
USING(SourceID) ORDER BY m.SourceID, m.method ASC limit 6;
dbname      orid        SourceID    method      Longitude-ant  Latitude-ant  Depth-ant   FullTime           
----------  ----------  ----------  ----------  -------------  ------------  ----------  -------------------
2014_Q3     1395        4301395p    ant2s3dp    -122.6557      45.5633       15.5        20140703T132841.314
UW          60058633    4301395p    uw2s3dp     -122.65567     45.56317      15.5        20140703T132841.314
2014_Q3     1404        4301404p    ant2s3dp    -121.9153      46.8328       10.4        20140715T135452.527
UW          60818252    4301404p    uw2s3dp     -121.91517     46.83267      10.44       20140715T135452.527
2014_Q3     1414        4301414p    ant2s3dp    -121.9637      46.9432       12.4        20140725T131355.671
UW          60827037    4301414p    uw2s3dp     -121.96367     46.94317      12.44       20140725T131355.671

Join tables: To make things easier, first I’ll create a FullEventTime column in a new loc_r1 table. I get around the duplicate row issue in map_r by using the DISTINCT statement and only using unique pairs of the SourceID and FullTime columns

  SELECT loc_r.*, m.FullTime FROM loc_r
    SELECT DISTINCT SourceID, FullTime FROM map_r ORDER BY SourceID) m 

Convert to correct date format: SQLite uses the ISO8601 date text format, which is slightly different than what I have here (yyyy-mm-ddTHH:MM:SS.sss instead of yyyymmddTHHMMSS.sss), so I need to add in a couple - and : symbols with the substr() function and concatenation with ||:

SELECT SourceID, Longitude, Latitude, Depth, EventTime, 
substr(FullTime,1,4) || '-' || substr(FullTime,5,2) || '-' || substr(FullTime,7,5) || ':' || substr(FullTime,12,2) || ':' || substr(FullTime,14) 
AS FullEventTime FROM loc_r1;

Now the loc_r2 table has the FullEventTime as well as the EventTime (in seconds):

SELECT * FROM loc_r2 limit 5;
SourceID    Longitude   Latitude    Depth       EventTime   FullEventTime          
----------  ----------  ----------  ----------  ----------  -----------------------
4301394p    -122.4719   45.8537     0.0         26.897      2014-07-01T22:26:26.897
4301395p    -122.6557   45.5633     15.5        41.314      2014-07-03T13:28:41.314
4301396p    -122.4267   46.1724     0.123       28.559      2014-07-09T18:19:28.559
4301401p    -122.4558   46.3526     1.42        30.893      2014-07-14T17:30:30.893
4301404p    -121.9153   46.8328     10.4        52.527      2014-07-15T13:54:52.527

Join location information to observation: Once the loc_r2 table has the FullEventTime, I can join it with the obs_r table using SourceID as the common key, and view the result:

  SELECT obs_r.*, loc_r2.EventTime, loc_r2.FullEventTime FROM obs_r

SELECT * FROM obs_r1 limit 5;
SourceID    Station     Phase       ArrivalTime  Uncertainty  EventTime   FullEventTime          
----------  ----------  ----------  -----------  -----------  ----------  -----------------------
4301394p    MO06        P           31.432       0.096        26.897      2014-07-01T22:26:26.897
4301394p    MM02        P           32.135       0.08         26.897      2014-07-01T22:26:26.897
4301394p    MO07        P           32.552       0.081        26.897      2014-07-01T22:26:26.897
4301394p    MN07        P           33.23        0.121        26.897      2014-07-01T22:26:26.897
4301394p    MJ02        P           33.79        0.092        26.897      2014-07-01T22:26:26.897

Make new time information: The FullArrivalTime is equal to FullEventTime + ArrivalTime – EventTime, which I can calculate within the strftime() function.

SELECT SourceID, Station, Phase, ArrivalTime, Uncertainty, EventTime, FullEventTime, 
strftime('%Y-%m-%dT%H:%M:%f',FullEventTime,'-'||EventTime||' seconds','+'||ArrivalTime||' seconds') 
AS FullArrivalTime FROM obs_r1;

Create final observation table

At this point we’re ready to create the final observation table:

CREATE TABLE observation AS
SELECT SourceID,Station,Phase,Uncertainty,FullArrivalTime FROM obs_r2;
SELECT * FROM observation limit 5; -- view the first 5 rows
SourceID    Station     Phase       Uncertainty  FullArrivalTime        
----------  ----------  ----------  -----------  -----------------------
4301394p    MO06        P           0.096        2014-07-01T22:26:31.432
4301394p    MM02        P           0.08         2014-07-01T22:26:32.135
4301394p    MO07        P           0.081        2014-07-01T22:26:32.552
4301394p    MN07        P           0.121        2014-07-01T22:26:33.230
4301394p    MJ02        P           0.092        2014-07-01T22:26:33.790

Make event table

For this, I need to join information from the antUW_r table to the loc_r table, which is made harder since antUW_r doesn’t have a SourceID column:

SELECT * FROM antUW_r limit 5;
PNSNid      dbname      orid        picker      Latitude-uw  Longitude-uw  Depth-uw  
----------  ----------  ----------  ----------  -----------  ------------  ----------
60810532    2014_Q3     2           Carl        46.5397      -122.9978     0.0       
60810642    2014_Q3     1394        Carl        45.8663      -122.4492     0.0       
60058633    2014_Q3     1395        Carl        45.5633      -122.6557     15.9      
60814162    2014_Q3     1396        Carl        46.1728      -122.4225     0.0       
0           2014_Q3     1401        Carl        46.3526      -122.4558     1.4 

This can be remedied by comparing the dbname and orid columns with the map_r table:

  SELECT a.*, m.SourceID FROM antUW_r a
  LEFT OUTER JOIN map_r m USING(dbname,orid);

Then join the new information to the loc_r2 table:

  SELECT l.*, a.PNSNid, a."Longitude-uw", a."Latitude-uw", a."Depth-uw" FROM loc_r2 l

Create final event table

At this point, just choose the desired columns to keep, and create the final table:

  SELECT SourceID, PNSNid, Longitude, Latitude, Depth, FullEventTime, "Longitude-uw", "Latitude-uw", "Depth-uw" 
  FROM loc_r3;
SELECT * FROM event limit 5;
SourceID    PNSNid      Longitude   Latitude    Depth       FullEventTime            Longitude-uw  Latitude-uw  Depth-uw  
----------  ----------  ----------  ----------  ----------  -----------------------  ------------  -----------  ----------
4301394p    60810642    -122.4719   45.8537     0.0         2014-07-01T22:26:26.897  -122.4492     45.8663      0.0       
4301395p    60058633    -122.6557   45.5633     15.5        2014-07-03T13:28:41.314  -122.6557     45.5633      15.9      
4301396p    60814162    -122.4267   46.1724     0.123       2014-07-09T18:19:28.559  -122.4225     46.1728      0.0       
4301401p    0           -122.4558   46.3526     1.42        2014-07-14T17:30:30.893  -122.4558     46.3526      1.4       
4301404p    60818252    -121.9153   46.8328     10.4        2014-07-15T13:54:52.527  -121.9153     46.8328      11.8 


Now we have our three desired tables: station, event, and observation. These record the travel-time information that I collected along with a few undergraduate volunteers, and can be put to a number of uses, from being input into tomographic inversion programs, to being used for earthquake (re)location, to comparing to the existing PNSN database.