README



************************
*******           ******
***       kplus      ***
*******           ******
************************


27.2.2015. 12:00
CAPTAINS LOG: FIRST COMMIT:
  The parsing is to be done from here:
  http://online.konzum.hr/#!/categories/60004323/hrana?show=all&sort_field=name&sort=nameAsc&max_price=22290&page=1&per_page=5430
  A very important feature is that the "per_page" can be freely changed.

  The XHR requests carry all the information needed.
  They seem to have some sort of protection from foreign requests.
  The only protection so far seems to be the "time" key provided with the 
  link which I have yet to guess.
  So for instance http://online.konzum.hr/v2/categories?time=1425038753938
  It provides us with the information which categories 
  and subcategories exist.

  Potential defense mechanisms:
  _ws-rails_session_id
  Phusion Passenger 4.0.41
  this "time" which is somehow extracted
  X-Auth-Token probably from "Phusion Passenger"
  "WEBSHOP_COOKIE_online.konzum.hr"
  "konzum_hr_wsm_auth_token"

-------------------------------------
27.2.2015. 22:26
UPDATE:
  Well, as it turns out, it was a bit easier than expected.
  First of all, some useful reads:
  http://en.wikipedia.org/wiki/XMLHttpRequest
  http://www.w3.org/TR/cors/
  http://axilis.com/ <- creators of the website
  Fun fact: the creators of the latest Konzum website have been participants
  in a project which parsed their older website.
  http://www.html5rocks.com/en/tutorials/file/xhr2/
  http://en.wikipedia.org/wiki/Cross-site_request_forgery
  https://www.linkedin.com/profile/view?id=130331247

  All in all, there were no authentication issues. 
  No authentication was neccessary.
  My guess would be that it is not proper to make a xhr request by just 
  running it from the browser but has to be run as a script.
  All my "cookies" and "user-agent" field were rejected when 
  making a request.
  The main thing was to find a nodejs xhr library and there seems to be 
  an "xhr2" lib which had done just that.
  After that and some googling I found a code example and viola, 
  I have 1200 Konzum items in my file with all the info I'd need.

---------------------------------------
26.3.2015. 9:48
UPDATE:
  Can't believe it's already been a month.
  We'll, after some trying to get things working in Haskell 
  I must admit that I've given up.
  The issue was that the JSON I have is huge and has a lot of restricted 
  keywords so I'd have to do a lot more
  work with getting the parser to work and all that doesn't guarantee that 
  Haskell would be the right choice for this.
  My next step would be to get this working in a mongodb.
  Tha concept would be:
    1. A script is periodically running (every 24h?) which scrapes Konzum 
        for an update to prices.
    2. A second script imports these file to mongodb

  After that I would have to manually figure out what do I actually 
  want to do with the data.
  I suppose some cool graphs in d3js would be nice, I've wanted to do 
  something with this for quite a while.
  I've made some bad choices regarding my server but I suppose this is 
  how hacks happen: when you are lazy.
  I suppose my lazyness comes from inexperience. If I was to do things 
  securely it would take me far too long
  and I already have a ton of things to figure out. I have to get some 
  security habits ASAP.

  var cd = new Date();
  var datetime = cd.getDay() + "_" + cd.getMonth() + "_" + cd.getFullYear() 
    + "-" + cd.getHours() + "_" + cd.getMinutes() + "_" + cd.getSeconds();
  var request = new XMLHttpRequest();
  var path="http://online.konzum.hr/v2/categories/60006861/products?filter%5Bshow%5D=all&filter%5Bsubcategory_id%5D=&filter%5Bsort_field%5D=name&filter%5Bsort_type%5D=asc&filter%5Bprice%5D%5Bmin%" +
          "5D=0&filter%5Bprice%5D%5Bmax%5D=110&filter%5Bsort%5D=nameAsc&per_page=1&page=1&time=1427296297983";
  request.open("GET", path, true);
  request.onreadystatechange = function() { fs.writeFileSync('konzum_' + datetime + '.dump', request.response); };
  request.setRequestHeader('Accept', 'application/json, text/plain');

26.3.2015. 10:50
UPDATE:
  So, the script is running.
  I've set it to be saved in /home/
  crontab -e opens the crontab file
  crontab is used to periodically run scripts.
  0 8,20 * * * nodejs /home/kparse.js
  runs the script every day at 8am and 8pm
  I wish I had done this sooner to have more interesting data.
  I suppose I should back this up aswell.
  Maybe use crontab to mail this? hehe

  Ok, I've not finished with the script.
  I will make it get all the items from all the categories 
  after the categories have been parsed.

29.3.2015. 20:40
UPDATE:
  Man, time flies by.
  Ok, the script is up and running.
  It runs every day at 7am and each run takes up 20mb.
  I hope I'll get some useful information by the end of this.

3.5.2015. 00:42
UPDATE:
  The script is running regularly still without any issues whatsoever.
  I have hosted a mysql database on the server now as well which has 
  all the entries which have been collected by now (300k+ rows).

  I have had some issues with uploading all these entries.
  I've used nodejs and its nodejs-mysql module which ROCKS!!
  At first things were slow with the simple connection (5h for 100k rows)
  but later when editing only 2 lines and adding 99 more connections
  the whole database upload lasts around 15min.
  Also, there was an issue with the codepage so after I changed it to utf8
  to support lćšlžč I had to reupload it again.
  I've created a primary key (id, datum) which seems useful.

  Before the database I had some issues regarding making a chart. 
  I've tried bokeh python lib and chartist/chartjs libs for JS
  but the biggest problem was crunching the huge files.
  It should be much better now that the database is up and running smoothly.

  Also mongoose app has been a huge help.
  Basicly, it sets up a localhost server from the folder from which it 
  was ran  and then you can make requests to it which reaaaly helped me with 
  getting around the "local file reading disabled" restriction in js. 
  Now that I think of it, I might have needed it only for browser scripts 
  while the local nodejs ran one should have been able to use nodejs fs 
  but oh whell, good to know it exists.

19.5.2015. 15:11
UPDATE:
  Time to start wrapping things up.
  
  Useful regex
  \n.+ezonski.+\n

  I have had some issues while reuploading.
  Not sure how, but somehow there have occurred mistakes while
  copying the file or downloading it and some crucial JSON elements have
  become broken so the parser does not work.
  Anyhow, these things from these files have been changed:
  (in case I'd have to fix it again on the server which I sincerely
  hope I won't have to do (soon ;( )))


  6","name":"Fackelmann rezač krastavaca/kupusa drveni","description":null
  in
  C:\Users\Dito\Desktop\kplus\dump\2015_04_29-7_0_1_Sve za dom.dump
  OLD: "nulL" NEW: "null"

  C:\Users\Dito\Desktop\kplus\dump\2015_04_28-7_0_1_Sve za dom.dump
  /categories/60005072/skolski-i-uredski-asortiman"},{"id"60004814,

  C:\Users\Dito\Desktop\kplus\dump\2015_04_24-7_0_1_Igra$ke.dump
  "barcode"

  C:\Users\Dito\Desktop\kplus\dump\2015_04_23-7_0_1_Pi$a.dump
  /categories/6000566/bezalkoh

  C:\Users\Dito\Desktop\kplus\dump\2015_04_23-7_0_1_Pi$a.dump
  ,"image_m":/images/products/031/03180007m.gi

  C:\Users\Dito\Desktop\kplus\dump\2015_04_23-7_0_1_Hrana.dump

  C:\Users\Dito\Desktop\kplus\dump\2015_04_22-7_0_1_Knjige.dump
  :null,"volume":null,"barcode":Null}],"ba