Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decision on handling JSON fields in download #61

Open
tomschenkjr opened this issue Oct 27, 2015 · 4 comments
Open

Decision on handling JSON fields in download #61

tomschenkjr opened this issue Oct 27, 2015 · 4 comments

Comments

@tomschenkjr
Copy link
Contributor

This is for discussion purposes...

The program development team is leaning toward using JSON as the default download format. This decision is based on JSON being a more reliable and faster method of download--which was confirmed with the Socrata development team.

However, JSON and CSV differs in an important way. JSON files have three fields not available in the CSV and not seen in the web interface:

  • location.latitude
  • location.longitude
  • location.need_geocoding

The first two items are parsimonious breakouts of the concatenated location column that is generated by Socrata. However, these columns may be highly redundant as it's a common practice to upload a "latitude" and "longitude" column which is used to create a "location" column.

If we keep location.latitude and location.longitude, we should convert them to numbers to serve their practical function.

The location.needs_geocoding is less useful. It's an internal flag for Socrata to handle their geocoding practice to display on their maps. This is pertinent when the location column is not lat/long, but an address field. I don't see much reason to keep this column except it's the easier thing to do.

Something to keep in mind is the consistency of what one sees in R versus the web browser. Passing along a valid SoDA endpoint (e.g., example.com/resource/four-four.csv or example.com/resource/four-four.json) can be viewed in the browser. In our case, it always chooses JSON which can be inconsistent with the CSV columns. Need to balance this in the discussion.

Open for discussion on the best way to handle these three columns: keep them, drop (some of) them, or use trickier logic to align them to the request (e.g., drop them for CSV requests, keep them otherwise).

/cc @dmpe @geneorama

@dmpe
Copy link
Contributor

dmpe commented Oct 27, 2015

Just on the record:

The program development team is leaning toward using JSON as the default download format.
I will continue to believe that JSON and CSV should be both default ones. And user should choose one of them. But, yes, for the consensus, I agree to support JSON only and convert .csv to .json.

Others:

If we keep location.latitude and location.longitude, we should convert them to numbers to serve their practical function.
Yes, i would keep them and convert to numbers.

I don't see much reason to keep this column except it's the easier thing to do.

Agree, me too.

Something to keep in mind is the consistency of what one sees in R versus the web browser.
In our case, it always chooses JSON which can be inconsistent with the CSV columns.

Easy thing to do. Just let it be documented, somewhere.

Finally:
'drop some of them' 👍

@geneorama
Copy link
Member

I just had to work around this exact issue again, so it's fresh in my mind.

I agree with @dmpe on all points, and would add a little more.

Details about location

In every case that I've seen the location value, it's a duplicate of the Latitude and Longitude columns.
image

Names like location.latitude are just a side effect of nested data being flattened. In some sense "latitude" is actually the full field name, and location is the name of the container.

CSV vs JSON column names

I think it would be fine / best to just use CSV column names, which are just called names in the metadata. If the names are set after the data.frame construction, they can be converted to match the current data.frame output with the function make.names.

JSON CONTENT

As far as I can tell nested data like location is always a duplicate of other fields, so it would be a lot cleaner to avoid nested content, and there'd be no loss of information.

@tomschenkjr
Copy link
Contributor Author

The location field can take on other forms besides lat/long. By default (when uploading data), Socrata doesn't duplicate the fields. However, we've set our portal to show both latitude, longitude, and location (even though it's duplicative). But, duplication cannot be guaranteed.

The location field can also be something besides coordinates. For instance, the location field can be an address (e.g., 123 Main Street). We've avoided that on our portal (it's not useful in most mapping programs), but happens occasionally. It's probably more frequent on other data portals. It'll depend on the setup of the portal.

@geneorama
Copy link
Member

I found a situation where you would not get all the columns in JSON that you would in CSV (if you skipped the "nested" fields as I was proposing).
https://data.edmonton.ca/resource/sy89-z97q
https://data.edmonton.ca/resource/sy89-z97q.csv
https://data.edmonton.ca/resource/sy89-z97q.json

The campaign_website field doesn't need to have two elements (since the second one is always NULL), but it does.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants