Skip to content

Commit

Permalink
Use JSON API and capture new bike_list data (#3)
Browse files Browse the repository at this point in the history
* Add basic test cases

* Add separate request function and associated tests

* Remove each-both from mkplace

* Replace params with .proc.params

* Replace wget with curl

* Remove webpage flag

* Remove .xml.p wrapper

* Use JSON API and update tests to match

* Remove unused columns

* Fix JSON tests

* Add additional comments

* Update README references from xml to json

* Remove xml test file

* Remove unused kdblog flag

* Remove xml with maximum prejudice

* Add correct error logging function

* Update logging

* Add logging test

* Add bikes namespace

* Include notes on log replaying

* TorQify those filepaths

* Changed to .Q.hg. Added bike_list data. Fixed writedown function.

* Removed unnecessary comments.

* Fixed errors in README

* Fixed typo. Minor appearance changes.

* Added stop bikes script. Modified writedown function so it would not override data already in hdb.

* Added stop_bikes.sh to readme.

* Fixed typo. Changed method of passing in rdb port.

* Added error handling to the stop script. Cleaned up writedown method. Updated README.

* Removed bug in stop script that was being used for testing.

Co-authored-by: Thomas Smyth <[email protected]>
  • Loading branch information
jlogan540 and tsmyth-aquaq authored Mar 19, 2020
1 parent d47349a commit e304337
Show file tree
Hide file tree
Showing 15 changed files with 233 additions and 139 deletions.
1 change: 1 addition & 0 deletions .gitattributes
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
*.q linguist-language=q
78 changes: 61 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,22 +1,36 @@
# Belfast Bikes

## Parsing XML Data from the Nextbike Platform

Given a semi-public API, what data analysis can we perform? The nextbike platform provides an xml of raw data, detailing bike availability at each station in the cities and countries where the system exists. In particular, we have focused on Belfast (city 238) where BelfastBikes has 43 docking stations available to rent a bike from, although we can modify this to be any city on the platform, using the respective city number.

Using kdb+ we can pull this data periodically off the web and create a time-series database to perform analysis on, made possible by the TorQ framework. Taking advantage of TorQ's capabilities, the process has been enhanced to include error logging, history logging and an extension of the kdb+ built-in timer, making it repeat until a specified end-time. Theoretically this allows us to view log files for any date and rebuild the usage of BelfastBikes for any date where we have collected information.

Once a database has been created, time-series and other queries can be executed against the data. For example, by rebuilding the table for yesterday's bike usage, we can create a mapping of 'routes' that each bike has taken, I.e. where the bike has checked in during those 24 hours and at what time during the day.
## Parsing JSON Data from the Nextbike Platform

Given a semi-public API, what data analysis can we perform? The nextbike platform
provides both an XML and JSON output of raw data, detailing bike availability at
each station in the cities and countries where the system exists. In particular,
we have used the JSON output to focus on Belfast (city 238) where BelfastBikes
has 46 docking stations available to rent a bike from, although we can modify
this to be any city on the platform, using the respective city number.

Using kdb+ we can pull this data periodically off the web and create a time-series
database to perform analysis on, made possible by the TorQ framework. Taking
advantage of TorQ's capabilities, the process has been enhanced to include error
logging, history logging and an extension of the kdb+ built-in timer, making it
repeat until a specified end-time. This allows us to view log files
for any date and rebuild the usage of BelfastBikes for any date where we have
collected information.

Once a database has been created, time-series and other queries can be executed
against the data. For example, by rebuilding the table for yesterday's bike
usage, we can create a mapping of 'routes' that each bike has taken, i.e. where
the bike has checked in during those 24 hours and at what time during the day.

## Requirements


Basic knowledge of the q programming language and linux commands is assumed.
This project requires the use of KDB+ and the [TorQ framework](https://github.com/AquaQAnalytics/TorQ).

## Getting Started:

- These bash commands will give directions on downloading TorQ and our BIKE message package. The BIKE package will be placed on top of the base TorQ package.
- These bash commands will give directions on downloading TorQ and our BIKE message
package. The BIKE package will be placed on top of the base TorQ package.

1. Make a directory to check the git repos into, and a directory to deploy the system to.

Expand Down Expand Up @@ -44,28 +58,49 @@ This project requires the use of KDB+ and the [TorQ framework](https://github.co
You should have a combination of each directories content included in the deploy directory:

~/deploy$ ls
appconfig aquaq-torq-brochure.pdf bikes.xml code config docs hdb html lib LICENSE logs mkdocs.yml README.md setenv.sh start_bikes.sh tests torq.q xmllogs
appconfig aquaq-torq-brochure.pdf code config docs hdb html lib LICENSE logs mkdocs.yml README.md setenv.sh start_bikes.sh stop_bikes.sh tests torq.q jsonlogs

## Configuration

You can change the city that you want to collect data for by changing the city number variable in `deploy/setenv.sh`, the default value is 238 (Belfast).
You can change the city that you want to collect data for by changing the city
number variable in `deploy/setenv.sh`; the default value is 238 (Belfast).
The default port is set at 14000, this can also be modified in the `setenv.sh` script.

## Launching the Process
To launch the process of retrieving data about each BelfastBikes location, run the start_bikes.sh executable in the `deploy` directory
To launch the process of retrieving data about each BelfastBikes location, run
the `start_bikes.sh` executable in the `deploy` directory
```
~/deploy$ ./start_bikes.sh
```
This launches the bikes.q script wrapped in the TorQ framework.

## Collecting Data
The process will run for 14 day once it starts, collecting data every 30 seconds. During the 14 days, there will be a write down to hdb at 6am every day using the previous day's data and saved by date. Within the bikes.q script there are timer functions for both the collection of data and the writedown which can be modified.

The xmllogs directory will contain previously collected data in its raw XML format for each day, saved as a plain text file.
The process will run for 14 days once it starts, collecting data every 30 seconds.
During the 14 days, there will be a write down to hdb at 6am every day using the
previous day's data and saved by date. Within the bikes.q script there are timer
functions for both the collection of data and the writedown which can be modified.

The jsonlogs directory will contain previously collected data in its raw JSON
format for each day, saved as a plain text file.

## Replaying Logs

In order to replay a log file on disk, the following can be used:
```
.bikes.replayjsonlog 2019.01.01
```
This reads the data to the in memory table `place`, which can then be written to
disk with:
```
.bikes.writedown 2019.01.01
```

## Example Usage

To query the persisted data in the hdb, we can either load in a specific date partition to a q session, or load the entire database to perform queries across a range of dates. To load in the hdb to a q session we can run the following command:
To query the persisted data in the HDB, we can either load in a specific date
partition to a q session, or load the entire database to perform queries across
a range of dates. To load in the HDB to a q session we can run the following command:
```
~/deploy$ q hdb/
```
Expand All @@ -82,7 +117,8 @@ data:select uid,name,lat,lng,time by bike_numbers from lj[ungroup s;t]
```
This data table table will show what stations a particular bike has visited during that day.

Within a partition of our database, we could find out how many rentals were taken from each docking station during that particular day:
Within a partition of our database, we could find out how many rentals were taken
from each docking station during that particular day:
```
q)`x xdesc (select count i by uid from d:select from (update d: differ uid by bike_numbers from (ungroup select time,uid,bike_numbers from place)) where d,not bike_numbers=0) lj select last name by uid from place
uid | x name
Expand All @@ -96,7 +132,8 @@ uid | x name
1257794| 19 "Belfast City Hospital Lisburn Rd"
555520 | 17 "Queens University / Botanic Gardens "
```
Or we could query across a range of dates within the database, for example finding out which docking station was most popular across several days:
Or we could query across a range of dates within the database, for example
finding out which docking station was most popular across several days:
```
q)select max x,Dock:name where x=max x by date from ((select count i by uid,date from d:select from (update d: differ uid by bike_numbers from (ungroup select date,time, uid,bike_numbers from place)) where d,not bike_numbers=0) lj select last name by uid from place)
date | x Dock
Expand All @@ -106,3 +143,10 @@ date | x Dock
2017.09.25| 41 "Alfred Street / St Malachy's Church"
```

## Stopping the Process

In order to stop the process, run the `stop_bikes.sh` executable, in the `deploy` folder.
```
~/deploy$ ./stop_bikes.sh
```
This will save down data collected during the day and kill the process.
5 changes: 4 additions & 1 deletion appconfig/settings/belfastbikes.q
Original file line number Diff line number Diff line change
@@ -1 +1,4 @@
.proc.loadprocesscode:1b
.proc.loadprocesscode:1b;

hdbdir:hsym`$getenv`KDBHDB; / location of hdb
webpage:"https://nextbike.net/maps/nextbike-live.json"; / URL for nextbike API
155 changes: 110 additions & 45 deletions code/belfastbikes/bikes.q
Original file line number Diff line number Diff line change
@@ -1,54 +1,119 @@
/ Record station info from nextbike API

params:.Q.opt[.z.x]

getbikedata:{
/Retrieve data from website and save to xml file
system[raze"wget -q -O bikes.xml ",params[`webpage],"?city=",params[`cityno]];
/Remove any spaces between quotations in preparation for parsing
l:raze read0`:bikes.xml; pos:o where any (o:where l = " ") within/: 2 cut where "\""=/:l;
l[pos]:"^";l}

logbikedata:{[t;f]
/Open connection to file using current time on request
hdat:hopen hsym`$raze[params[`xmllog]],"/","xmllog_",ssr[string[.z.D];".";""],"_",raze params[`cityno],".txt";
/Write data on single line possibly with time appending each time
hdat string[t]," -- ", f,"\n";
/Close connection to file.
hclose[hdat];
}

parsedata:{[x]
/Use xml.q p function to parse
parsed:.xml.p[x]; parsed}

mkplace:{[parsed]
iplace: update "F"$'lat, "F"$'lng, "I"$'uid, "I"$'number, "I"$'bikes, "I"$'bike_racks, "I"$'free_racks, 0^"I"$'"," vs'bike_numbers, "I"$'place_type, "I"$'rack_locks from delete spot, bike_types, bike from `time xcols select from update time:.z.P from update name:{[x]ssr[x;"^";" "]}'[name] from uj/[enlist each parsed[0;2;0;2;0;2;;1]];
`place insert iplace;}

\d .bikes

hdbdir:@[value;`hdbdir;hsym`$getenv`KDBHDB];
webpage:@[value;`webpage;"https://nextbike.net/maps/nextbike-live.json"];

// Request data from nextbike API
request:{
.lg.o[`bikes;"Requesting data from nextbike for city ",c:raze .proc.params`cityno];
/Retrieve data from website
req:.Q.hg hsym `$webpage,"?city=",c;
.lg.o[`bikes;"Returning data for city ",c];
:req;
};

// Get JSON log file name for date d
getjsonlog:{[d]
:hsym`$raze[.proc.params`jsonlog],"/jsonlog_",(string[d]except"."),"_",raze .proc.params[`cityno],".txt";
};

// Log output of API request to file
logbikedata:{[t;j]
fn:getjsonlog`date$t;
.lg.o[`bikes;"Writing to JSON log: ",f:.os.pth fn];
/Open connection to file using current time on request
hdat:hopen fn;
/Write data on single line with corresponding time
hdat string[t]," -- ",j,"\n";
/Close connection to file.
hclose hdat;
.lg.o[`bikes;"Finished writing to JSON log: ",f];
};

// Replay a JSON log into memory
replayjsonlog:{[d]
if[()~key fn:getjsonlog d;
.lg.o[`bikes;"Could not find log file, exiting early: ",.os.pth fn];
:();
];
.lg.o[`bikes;"Found log file, beginning replay: ",f:.os.pth fn];
/Replay each line of log file in turn
{mkplace . readlogline x}'[read0 fn];
.lg.o[`bikes;"Finished replaying log file: ",f];
};

// Parse line from log file
readlogline:{@[;1;.j.k]@[0 29 33 cut x;0;"P"$]0 2};

// Parse json into in memory table place
mkplace:{[t;parsed]
.lg.o[`bikes;"Starting to parse JSON..."]
/Extract tables from JSON
tab:first[first[parsed`countries]`cities]`places;
bike_tab:update name: (exec raze (count each bike_list) #' enlist each name from tab) from exec raze bike_list from tab;
/Refactor data and extract relevant data
tab:`address`bike_list`spot`bike_types`bike _`time xcols update time:.z.P^t,name:trim name from tab;
bike_tab:`pedelec_battery`battery_pack _ `time`name xcols update time:.z.P^t,name:trim name,number:"I"$number,lock_types:raze lock_types from bike_tab;
/Convert floats to ints where appropriate
tab:@[tab;`uid`number`bikes`bike_racks`free_racks;`int$];
tab:@[tab;`place_type`bike_numbers;"I"$];
bike_tab:@[bike_tab;`bike_type`boardcomputer;`int$];
.lg.o[`bikes;"Finished parsing JSON, adding to in memory tables"];
/Insert data into table in memory
`place insert tab;
.lg.o[`bikes;"Added data to in memory table: place"];
`bike_list insert bike_tab;
.lg.o[`bikes;"Added data to in memory table: bike_list"];
};

// Make request to nextbike API, log to disk and parse into in memory table
fullbikedata:{
/Write messages to out logs as requests are processed
.lg.o[1;"Starting to make requests"];
l:getbikedata[];
.lg.o[1;"Finished request"];
logbikedata[.z.P;l];
.lg.o[1;"Finished logging"];
parsed:parsedata[l];
.lg.o[1;"Finished parsing"];
mkplace[parsed];
.lg.o[1;"Requests complete!"];
}
.lg.o[`bikes;"Request started"];
/Record time of request
rt:.z.P;
/Request data from nextbike API
l:request[];
/Write messages to out logs as requests are processed
logbikedata[rt;l];
/Parse JSON into a table and add to in memory table
mkplace[rt;.j.k l];
.lg.o[`bikes;"Request complete"];
};

fullbikedataprotected:{[] @[fullbikedata;`;{[x] show "Error running fullbikedata",x}]};
fullbikedataprotected:{[]@[fullbikedata;`;{[x].lg.e[`bikes]"Error running fullbikedata: ",x}]};

// Write data to disk for date d
writedown:{[d]
dir:` sv .Q.par[hdbdir;d;`place],`;
bikesdir:` sv .Q.par[hdbdir;d;`bike_list],`;
.lg.o[`bikes;"Writing place data to: ",.os.pth dir];
.lg.o[`bikes;"Writing bike_list data to: ",.os.pth bikesdir];
/Checks if directory already exists. Appends data if it does.
checkdir:hsym `$"hdb/",string .z.d;
cmd:$[`place in key checkdir;(insert);set];
cmd[dir; select from `..place where time.date=(d)];
cmd:$[`bike_list in key checkdir;insert;set];
cmd[bikesdir;select from `..bike_list where time.date=(d)];
};

//Repeat for 14 days - every 30 seconds
.timer.repeat[.proc.cp[];.proc.cp[]+14D00:00;0D00:00:30;(fullbikedataprotected;`);"belfastbikes"]
// Clear data for date d
cleardate:{[d]
delete from `..place where time.date=d;
delete from `..bike_list where time.date=d;
};

//At 6am each day, write down yesterdays data to hdb, and delete the data in memory from 2 days before
writedown:{(hsym `$raze"hdb/",string (.z.d-1),`$"/place/") set select from place where time.date=(.z.d-1);
delete from `place where time.date=(.z.d-2)}
// Write yesterdays data to disk
eodwritedown:{
writedown .z.d-1;
cleardate .z.d-2;
};

@[writedown;;"writedown failed"]
\d .

// Repeat for 14 days - every 30 seconds
.timer.repeat[.proc.cp[];.proc.cp[]+14D00:00;0D00:00:30;(.bikes.fullbikedataprotected;`);"belfastbikes"];

.timer.repeat[(.z.D+1)+06:00:00.000000;.z.d+14;0D01:00:00;(writedown;`);"dailyWritedownBikes"]
// At 6am each day, write down yesterdays data to hdb, and delete the data in memory from 2 days before
.timer.repeat[(.z.D+1)+06:00:00.000000;.z.d+14;0D01:00:00;(.bikes.eodwritedown;`);"dailyWritedownBikes"];
2 changes: 0 additions & 2 deletions code/belfastbikes/order.txt

This file was deleted.

72 changes: 0 additions & 72 deletions code/belfastbikes/xml.q

This file was deleted.

6 changes: 6 additions & 0 deletions code/util/intradaybikeswd.q
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
// Gets the rdb port past in from the command line.
conn:.Q.def[(enlist `conn)!enlist 0Nj;.Q.opt .z.x][`conn];
// Opens a handle to rbd and calls the writedown function.
bikerdb:@[hopen;conn;{-2 "Cannot perform writedown. Unable to open connection, error: ",x;exit 1;}];
bikerdb".bikes.writedown[.z.d]";
exit 0;
4 changes: 3 additions & 1 deletion setenv.sh
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,9 @@ export KDBLOG=${TORQHOME}/logs
export KDBHTML=${TORQHOME}/html
export KDBLIB=${TORQHOME}/lib
export KDBBASEPORT=14000
export XMLLOG=${TORQHOME}/xmllogs
export KDBHDB=${TORQHOME}/hdb
export KDBTESTS=${TORQHOME}/tests
export JSONLOG=${TORQHOME}/jsonlogs
export CITYNO=238

# sets the base port for a default TorQ installation
Expand Down
Loading

0 comments on commit e304337

Please sign in to comment.