Skip to content

sources data

Ray Kiddy edited this page Feb 4, 2025 · 1 revision

The "sources" table does a lot. It links the datasets to the files they get data from and to the tables therefrom. There are operational flags as well that guide the updating of datasets and the display in the HCAI web app.

The "file_name" column has a unique constraint on it because tables do need to have unique names in the database. Note that this may not always be true. For example, I could decide to split up the database into more than one. But then I would need to database name field in addition to the table_name column and the combination of the two columns would have a unique constraint. The value is just the leaf name of the file, not any directory that it is in.

There is a unique constraint on the set of columns: "ds_pk", "file_name", and "sheet_name".

The "auto_run" column tells the update_one.sh script whether there is any table data in this source file. If the file is an ".xlsx" file or a ".csv" file, then the "update_one.sh" script will pick up this files using the csv_import/helper and excel_import/helper scripts, which can run on many of the data files.

I cannot yet deal with a few file types, even though some of them might have significant data. There are ".xls" files which the library I am using does not handle. Also, I am not doing anything with PDF files at this point. A few datasets seem to use "docx" files for something significant, though often these files just contain random explanations and/or whining from the filers.

The "shared" column tells the update_one.sh script that this data table should be copied up to the server. This will be done nostly when there is an app for this data.

There are a few odd things about these last two, and sometimes the file_name values. Say that two files are turned into tables but then the data in the two tables are combined into one table. Then you will see a source row for the combined table that has its name but no file_name value. This combined table will probably have a shared value set to 1, because if it is important enough to massage the data, it is probably important enough to display. The first two tables will probably have auto_run set to 1 because we want the usual helper script to process them. But the combined table will have the auto_run value set to 0. This is because the combining operation will be from a exec_special_after.sh script and none of the automatic processing will need to know about this table. We will see this, for example, when the data from a "something_2022.csv" file and a "something_2023.csv" file are merged into a "something" table.

I have set the "auto_run" and "shared" column so that their default value is 0.

A filename may be used by different datasets, so we cannot rely on just a file_name to be unique. For example, a lot of datasets have a ".csv" file and there is nothing wrong with that. But the filename within the dataset will be unique. Unless we are using it more than once, so that we have a different entry for each sheet, as we do for xlsx files.

But right now I now see:

mysql> select ds_pk, file_name, sheet_name from sources where ds_pk = 304 order by file_name;
+-------+-----------------------------------------------+------------+
| ds_pk | file_name                                     | sheet_name |
+-------+-----------------------------------------------+------------+
|   304 | .csv                                          | NULL       |
|   304 | .csv                                          | NULL       |
|   304 | arcgis-geoservice.arcgis-geoservices-rest-api | NULL       |
|   304 | arcgis-hub-dataset.html                       | NULL       |
|   304 | arcgis-hub-dataset0.html                      | NULL       |
|   304 | datapackage.json                              | NULL       |
|   304 | excel.xlsx                                    | NULL       |
|   304 | feature-collection.txt                        | NULL       |
|   304 | geopackage.gpkg                               | NULL       |
|   304 | sqlite-geodatabase.gdb                        | NULL       |
+-------+-----------------------------------------------+------------+
10 rows in set (0.00 sec)

This is possible because each "NULL" is different than any other "NULL".

If I change these "NULL" values to be surrounded by double-underbars, then the constraint will complain when there is a duplicate. Just to be pedantic here, a NULL is not equal to any other thing, even another NULL. See:

mysql> create table testing (one int, two int, three int, four int);
Query OK, 0 rows affected (0.07 sec)

mysql> insert into testing values (1, 1, NULL, NULL);
Query OK, 1 row affected (0.01 sec)

mysql> select * from testing where one = two;
+------+------+-------+------+
| one  | two  | three | four |
+------+------+-------+------+
|    1 |    1 |  NULL | NULL |
+------+------+-------+------+
1 row in set (0.00 sec)

mysql> select * from testing where two = three;
Empty set (0.00 sec)

mysql> select * from testing where three = four;
Empty set (0.00 sec)
Clone this wiki locally