-
Notifications
You must be signed in to change notification settings - Fork 0
/
README
181 lines (152 loc) · 7.4 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
************************
******* ******
*** kplus ***
******* ******
************************
27.2.2015. 12:00
CAPTAINS LOG: FIRST COMMIT:
The parsing is to be done from here:
http://online.konzum.hr/#!/categories/60004323/hrana?show=all&sort_field=name&sort=nameAsc&max_price=22290&page=1&per_page=5430
A very important feature is that the "per_page" can be freely changed.
The XHR requests carry all the information needed.
They seem to have some sort of protection from foreign requests.
The only protection so far seems to be the "time" key provided with the
link which I have yet to guess.
So for instance http://online.konzum.hr/v2/categories?time=1425038753938
It provides us with the information which categories
and subcategories exist.
Potential defense mechanisms:
_ws-rails_session_id
Phusion Passenger 4.0.41
this "time" which is somehow extracted
X-Auth-Token probably from "Phusion Passenger"
"WEBSHOP_COOKIE_online.konzum.hr"
"konzum_hr_wsm_auth_token"
-------------------------------------
27.2.2015. 22:26
UPDATE:
Well, as it turns out, it was a bit easier than expected.
First of all, some useful reads:
http://en.wikipedia.org/wiki/XMLHttpRequest
http://www.w3.org/TR/cors/
http://axilis.com/ <- creators of the website
Fun fact: the creators of the latest Konzum website have been participants
in a project which parsed their older website.
http://www.html5rocks.com/en/tutorials/file/xhr2/
http://en.wikipedia.org/wiki/Cross-site_request_forgery
https://www.linkedin.com/profile/view?id=130331247
All in all, there were no authentication issues.
No authentication was neccessary.
My guess would be that it is not proper to make a xhr request by just
running it from the browser but has to be run as a script.
All my "cookies" and "user-agent" field were rejected when
making a request.
The main thing was to find a nodejs xhr library and there seems to be
an "xhr2" lib which had done just that.
After that and some googling I found a code example and viola,
I have 1200 Konzum items in my file with all the info I'd need.
---------------------------------------
26.3.2015. 9:48
UPDATE:
Can't believe it's already been a month.
We'll, after some trying to get things working in Haskell
I must admit that I've given up.
The issue was that the JSON I have is huge and has a lot of restricted
keywords so I'd have to do a lot more
work with getting the parser to work and all that doesn't guarantee that
Haskell would be the right choice for this.
My next step would be to get this working in a mongodb.
Tha concept would be:
1. A script is periodically running (every 24h?) which scrapes Konzum
for an update to prices.
2. A second script imports these file to mongodb
After that I would have to manually figure out what do I actually
want to do with the data.
I suppose some cool graphs in d3js would be nice, I've wanted to do
something with this for quite a while.
I've made some bad choices regarding my server but I suppose this is
how hacks happen: when you are lazy.
I suppose my lazyness comes from inexperience. If I was to do things
securely it would take me far too long
and I already have a ton of things to figure out. I have to get some
security habits ASAP.
var cd = new Date();
var datetime = cd.getDay() + "_" + cd.getMonth() + "_" + cd.getFullYear()
+ "-" + cd.getHours() + "_" + cd.getMinutes() + "_" + cd.getSeconds();
var request = new XMLHttpRequest();
var path="http://online.konzum.hr/v2/categories/60006861/products?filter%5Bshow%5D=all&filter%5Bsubcategory_id%5D=&filter%5Bsort_field%5D=name&filter%5Bsort_type%5D=asc&filter%5Bprice%5D%5Bmin%" +
"5D=0&filter%5Bprice%5D%5Bmax%5D=110&filter%5Bsort%5D=nameAsc&per_page=1&page=1&time=1427296297983";
request.open("GET", path, true);
request.onreadystatechange = function() { fs.writeFileSync('konzum_' + datetime + '.dump', request.response); };
request.setRequestHeader('Accept', 'application/json, text/plain');
26.3.2015. 10:50
UPDATE:
So, the script is running.
I've set it to be saved in /home/
crontab -e opens the crontab file
crontab is used to periodically run scripts.
0 8,20 * * * nodejs /home/kparse.js
runs the script every day at 8am and 8pm
I wish I had done this sooner to have more interesting data.
I suppose I should back this up aswell.
Maybe use crontab to mail this? hehe
Ok, I've not finished with the script.
I will make it get all the items from all the categories
after the categories have been parsed.
29.3.2015. 20:40
UPDATE:
Man, time flies by.
Ok, the script is up and running.
It runs every day at 7am and each run takes up 20mb.
I hope I'll get some useful information by the end of this.
3.5.2015. 00:42
UPDATE:
The script is running regularly still without any issues whatsoever.
I have hosted a mysql database on the server now as well which has
all the entries which have been collected by now (300k+ rows).
I have had some issues with uploading all these entries.
I've used nodejs and its nodejs-mysql module which ROCKS!!
At first things were slow with the simple connection (5h for 100k rows)
but later when editing only 2 lines and adding 99 more connections
the whole database upload lasts around 15min.
Also, there was an issue with the codepage so after I changed it to utf8
to support lćšlžč I had to reupload it again.
I've created a primary key (id, datum) which seems useful.
Before the database I had some issues regarding making a chart.
I've tried bokeh python lib and chartist/chartjs libs for JS
but the biggest problem was crunching the huge files.
It should be much better now that the database is up and running smoothly.
Also mongoose app has been a huge help.
Basicly, it sets up a localhost server from the folder from which it
was ran and then you can make requests to it which reaaaly helped me with
getting around the "local file reading disabled" restriction in js.
Now that I think of it, I might have needed it only for browser scripts
while the local nodejs ran one should have been able to use nodejs fs
but oh whell, good to know it exists.
19.5.2015. 15:11
UPDATE:
Time to start wrapping things up.
Useful regex
\n.+ezonski.+\n
I have had some issues while reuploading.
Not sure how, but somehow there have occurred mistakes while
copying the file or downloading it and some crucial JSON elements have
become broken so the parser does not work.
Anyhow, these things from these files have been changed:
(in case I'd have to fix it again on the server which I sincerely
hope I won't have to do (soon ;( )))
6","name":"Fackelmann rezač krastavaca/kupusa drveni","description":null
in
C:\Users\Dito\Desktop\kplus\dump\2015_04_29-7_0_1_Sve za dom.dump
OLD: "nulL" NEW: "null"
C:\Users\Dito\Desktop\kplus\dump\2015_04_28-7_0_1_Sve za dom.dump
/categories/60005072/skolski-i-uredski-asortiman"},{"id"60004814,
C:\Users\Dito\Desktop\kplus\dump\2015_04_24-7_0_1_Igra$ke.dump
"barcode"
C:\Users\Dito\Desktop\kplus\dump\2015_04_23-7_0_1_Pi$a.dump
/categories/6000566/bezalkoh
C:\Users\Dito\Desktop\kplus\dump\2015_04_23-7_0_1_Pi$a.dump
,"image_m":/images/products/031/03180007m.gi
C:\Users\Dito\Desktop\kplus\dump\2015_04_23-7_0_1_Hrana.dump
C:\Users\Dito\Desktop\kplus\dump\2015_04_22-7_0_1_Knjige.dump
:null,"volume":null,"barcode":Null}],"ba