-
Notifications
You must be signed in to change notification settings - Fork 5
/
Copy pathbitcrawl.py
executable file
·1571 lines (1422 loc) · 61.7 KB
/
bitcrawl.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
#!/usr/bin/python
"""Module for crawling the web, extracting numbers, counting links and other stats
Calculates statistics of the data gathered and plots 2-D plots.
Standard Module Dependencies:
argparse ArgumentParser
urllib urlencode, urlopen,...
urllib2 HTTPRedirectHandler, HTTPCookieProcessor, etc
time sleep
datetime now(), datetime.strptime(), datetime.datetime(), etc
httplib IncompleteRead
numpy
matplotlib pyplot.plot
Nonstandard Module Dependencies:
tz Local # local time zone object definition
TODO:
1. deal with csv: http://www.google.com/trends/?q=bitcoin&ctab=0&geo=us&date=ytd&sort=0 ,
<a href='/trends/viz?q=bitcoin&date=ytd&geo=us&graph=all_csv&sort=0&scale=1&sa=N'>
other examples in comments below
2. poll domain name registries to determine the number of domain names with "bitcoin" in them or beginning with "bit" or having "bit" and "coin" in them
3. build website and REST to share bitcoin trend info, several domain names saved at bustaname under shopper@tg username
pairbit, bitpair, coinpair, paircoin, coorbit, bitcorr, bitcoinarbitrage, etc
4. generalize the search with AI and ML to identify good and bad trends for all common proper names--stock symbols, etc
a) write research paper to prove it does as good a job as a human stock analyst at predicting future price movements
b) write a browser plugin that allows a human to supervise the machine learning and identify useful/relevant quantitative data
5. implement the indexer and search engine for the double-star question 3 in CS101 and get quant data directly from the index
6. implement the levetshire distance algorithm from the CS101 exam for use in word-stemming and search term similarity estimate
7. record response time of web pages as one of the stats associated with each url
8. use historical load-time data to prioritize quickly-loading pages over defunt, slow pages (like bitcoinsonsultancy.com)
:author: Hobson Lane dba TotalGood
:copyright: 2012 by Hobson Lane ([email protected]), see AUTHORS for details
:license: Creative Commons BY-NC-SA, see LICENSE for more details
"""
# TODO: smart import by wrapping all import statements in try: blocks
# TODO: smarter import with gracefull fallback to "pass" or simple local implementations for unavailable modules
# TODO: smartest import with pip install (setup.py install) of missing modules, if possible
# TODO: ai import with automatic, on-the-fly, python source code generation... with comments and docstrings! ;)
import datetime
import time
from tz import Local
import os
import urllib
import urllib2
import httplib
import json
from pprint import pprint
from argparse import ArgumentParser
import re
from warnings import warn
import matplotlib.pyplot as plt
from utils import size, size2, size3
import collections # .Iterable
FILEPATH=os.path.expanduser('data/bitcrawl_historical_data.json') # change this to a path you'd like to use to store data
MIN_ORDINAL=1800*365.25 # data associated with datetime ordinals smaller than this will be ignored
MAX_ORDINAL=2100*365.25 # data associated with datetime ordinals larger than this will be ignored
SAMPLE_BIAS_COMP = 0 # whether to divide variance values by N-1 (0 divides by N so that small sample sets still give 1 for Pearson self-correlation coefficient)
# Hard-coded regular expressions, keywords, and URLs for gleaning numerical data from the web
URLs={'network':
{
'url': 'http://bitcoincharts.com/about/markets-api/',
'blocks':
[r'<td class="label">Blocks</td><td>', # (?<= ... )\s*
r'[0-9]{1,9}' ], # (...)
'total_btc': # total money supply of BTC
[r'<td class="label">Total BTC</td><td>',
r'[0-9]{0,2}[.][0-9]{1,4}[TGMKkBb]' ],
'difficulty':
[r'<td class="label">Difficulty</td><td>',
r'[0-9]{1,10}' ],
'estimated': # total money supply of BTC
[r'<td class="label">Estimated</td><td>',
r'[0-9]{1,10}' ] ,
'blocks': # total money supply of BTC blocks
[r'<td class="label">Estimated</td><td>\s*[0-9]{1,10}\s*in',
r'[0-9]{1,10}' ] ,
'hash_rate': # THash/s on the entire BTC network
[r'<td class="label">Network total</td><td>',
r'[0-9]{0,2}[.][0-9]{1,4}' ],
'block_rate': # blocks/hr on the entire BTC network
[r'<td class="label">Blocks/hour</td><td>',
r'[0-9]{0,3}[.][0-9]{1,4}' ] } ,
'trade': {
'url': 'https://en.bitcoin.it/wiki/Trade',
'visits':
[r'has\sbeen\saccessed\s',
r'([0-9]{1,3}[,]?){1,4}' ] },
'shop': {
'url': 'https://en.bitcoin.it/wiki/Real_world_shops',
'visits':
[r'has\sbeen\saccessed\s',
r'([0-9]{1,3}[,]?){1,4}' ] },
'bitcoin': {
'url': 'https://en.bitcoin.it/wiki/Main_Page',
'visits':
[r'has\sbeen\saccessed\s',
r'([0-9]{1,3}[,]?){1,4}' ] },
# went "offline" sometime around May 20th
# 'consultancy': {
# 'url': 'https://bitcoinconsultancy.com/wiki/Main_Page',
# 'visits':
# [r'has\sbeen\saccessed\s',
# r'([0-9]{1,3}[,]?){1,4}' ] },
'mtgox': {
'url': 'https://mtgox.com',
'average':
[r'Weighted\s*Avg\s*:\s*<span>',
r'\$[0-9]{1,2}[.][0-9]{3,6}' ],
'last':
[r'Last\s*price\s*:\s*<span>',
r'\$[0-9]{1,2}[.][0-9]{3,6}' ],
'high':
[r'High\s*:\s*<span>',
r'\$[0-9]{1,2}[.][0-9]{3,6}' ],
'low':
[r'Low\s*:\s*<span>',
r'\$[0-9]{1,2}[.][0-9]{3,6}' ],
'volume':
[r'Volume\s*:\s*<span>',
r'[0-9,]{1,9}' ] },
'virwox': {
'url': 'https://www.virwox.com/',
'volume': # 24 hr volume
# (?s) means to match '\n' with dot ('.*' or '.*?')
[r"(?s)<fieldset>\s*<legend>\s*Trading\s*Volume\s*\(SLL\)\s*</legend>\s*<table.*?>\s*<tr.*?>\s*<td>\s*<b>\s*24\s*[Hh]ours\s*[:]?\s*</b>\s*</td>\s*<td>",
r'[0-9,]{1,12}'],
'SLLperUSD_ask':
[r"<tr.*?>USD/SLL</th><td.*?'buy'.*?>",
r'[0-9]{1,6}[.]?[0-9]{0,3}'],
'SLLperUSD_bid':
[r"<tr.*?>USD/SLL</th.*?>\s*<td.*?'buy'.*?>.*?</td>\s*<td.*?'sell'.*?>",
r'[0-9]{1,6}[.]?[0-9]{0,3}'],
'BTCperSLL_ask':
[r"<tr.*?><th.*?>BTC/SLL\s*</th>\s*<td\s*class\s*=\s*'buy'\s*width=\s*'33%'\s*>\s*", # TODO: generalize column/row/element extractors
r'[0-9]{1,6}[.]?[0-9]{0,3}'],
'BTCperSLL_bid':
[r"<tr.*?>BTC/SLL</th.*?>\s*<td.*?'buy'.*?>.*?</td>\s*<td.*?'sell'.*?>",
r'[0-9]{1,6}[.]?[0-9]{0,3}'] },
'cointron': {
'url': 'http://coinotron.com/coinotron/AccountServlet?action=home', # miner doesn't follow redirects like a browser so must use full URL
'hash_rate':
[r'<tr.*?>\s*<td.*?>\s*BTC\s*</td>\s*<td.*?>\s*',
r'[0-9]{1,3}[.][0-9]{1,4}\s*[TMG]H',
r'</td>'], # unused suffix
'miners':
[r'(?s)<tr.*?>\s*<td.*?>\s*BTC\s*</td>\s*<td.*?>\s*[0-9]{1,3}[.][0-9]{1,4}\s*[TGM]?H\s*</td>\s*<td.*?>'
r'[0-9]{1,4}\s*[BbMmKk]?',
r'</td>'], # unused suffix
'hash_rate_LTC': # lightcoin
[r'<tr.*?>\s*<td.*?>\s*LTC\s*</td>\s*<td.*?>\s*',
r'[0-9]{1,3}[.][0-9]{1,4}\s*[TMG]H',
r'</td>'], # unused suffix
'miners_LTC':
[r'(?s)<tr.*?>\s*<td.*?>\s*LTC\s*</td>\s*<td.*?>\s*[0-9]{1,3}[.][0-9]{1,4}\s*[TGM]?H\s*</td>\s*<td.*?>',
r'[0-9]{1,4}\s*[BbMmKk]?',
r'</td>'], # unused suffix
'hash_rate_SC': # scamcoin
[r'<tr.*?>\s*<td.*?>\s*SC\s*</td>\s*<td.*?>\s*',
r'[0-9]{1,3}[.][0-9]{1,4}\s*[TMG]H',
r'</td>'], # unused suffix
'miners_SC':
[r'(?s)<tr.*?>\s*<td.*?>\s*SC\s*</td>\s*<td.*?>\s*[0-9]{1,3}[.][0-9]{1,4}\s*[TGM]?H\s*</td>\s*<td.*?>',
r'[0-9]{1,4}\s*[BbMmKk]?',
r'</td>'] }, # unused suffix
}
def get_seeds(path='data/bitsites.txt'):
"""Read in seed urls from a flatfile (newline delimitted)
>>> print len(get_seeds())
68
"""
try:
f = open(path,'r')
except:
print 'Unable to find the file "'+path+'".'
return []
s = f.read()
return s.split('\n') # FIXME: what about '\r\n' in Windows
# Additional seed data URLs
# TODO: function to extract/process CSV
#Historic Trade Data available from bitcoincharts and not yet mined:
#Trade data is available as CSV, delayed by approx. 15 minutes.
#http://bitcoincharts.com/t/trades.csv?symbol=SYMBOL[&start=UNIXTIME][&end=UNIXTIME]
#returns CSV:
#unixtime,price,amount
#Without start or end set it'll return the last few days (this might change!).
#Examples
#Latest mtgoxUSD trades:
#http://bitcoincharts.com/t/trades.csv?symbol=mtgoxUSD
#All bcmPPUSD trades:
#http://bitcoincharts.com/t/trades.csv?symbol=bcmPPUSD&start=0
#btcexYAD trades from a range:
#http://bitcoincharts.com/t/trades.csv?symbol=btcexYAD&start=1303000000&end=1303100000
#Telnet interface
#There is an experimental telnet streaming interface on TCP port 27007.
#This service is strictly for personal use. Do not assume this data to be 100% accurate or write trading bots that rely on it.
class Bot:
"""A browser session that follows redirects and maintains cookies.
TODO:
allow specification of USER_AGENT, COOKIE_FILE, REFERRER_PAGE
if possible should use the get_page() code from the CS101 examples to show "relevance" for the contest
Examples:
>>> len(Bot().GET('http://totalgood.com',retries=1,delay=0,len=100))
100
"""
def __init__(self):
self.retries = 0
self.response = ''
self.params = ''
self.url = ''
# TODO: implement getter/setters for username and password to get past paywalls
# # Create an OpenerDirector with support for Basic HTTP Authentication...
# auth_handler = urllib2.HTTPBasicAuthHandler()
# auth_handler.add_password(realm='PDQ Application',
# uri='https://mahler:8092/site-updates.py',
# user='klem',
# passwd='kadidd!ehopper')
# opener = urllib2.build_opener(auth_handler)
# # ...and install it globally so it can be used with urlopen.
# urllib2.install_opener(opener)
# urllib2.urlopen('http://www.example.com/login.html')
self.redirecter = urllib2.HTTPRedirectHandler()
self.cookies = urllib2.HTTPCookieProcessor()
self.opener = urllib2.build_opener(self.redirecter, self.cookies)
# replace the default urllib2 user-agent
self.opener.addheaders = [('User-agent', 'Mozilla/5.0')]
def GET(self, url, retries=2, delay=2, len=1e7):
# FIXME: doesn't work on no HTTPS urls!!
self.retries = max(self.retries, retries)
# don't wait less than 0.1 s or longer than 1 hr when retrying a network connection
delay = min(max(delay,0.1),3600)
file_object, datastr = None, ''
try:
#print 'opening ', url
file_object = self.opener.open(url)
# build_opener object doesn't handle 404 errors, etc !!!
# TODO: put all these error handlers into our Bot class
except httplib.IncompleteRead, e:
print "HTTP read for URL '"+url+"' was incomplete: %d" % e.code
except urllib2.HTTPError, e:
print "HTTP error for URL '"+url+"': %d" % e.code
except httplib.BadStatusLine, e:
print "HTTP bad status link for URL '"+url+"': %d" % e.code
except urllib2.URLError, e:
print "Network error for URL '"+url+"': %s" % e.reason.args[1]
if not file_object:
# retry
if retries:
print "Waiting "+str(delay)+" seconds before retrying network connection for URL '"+url+"'..."
print "Retries left = "+str(retries)
time.sleep(delay)
print "Retrying network connection for URL '"+url+"'."
return self.GET(url,retries-1)
print "Exceeded maximum number of Network error retries."
else:
try:
datastr = file_object.read(len) # should populate datastr with an empty string if the file_object is invalid, right?
except:
print('Error reading http GET response from url '+repr(url)+
' after at most '+str(self.retries)+' retries.')
return datastr
def POST(self, url, params):
self.url = url
self.params = urllib.urlencode(params)
self.response = self.opener.open(url, self.params ).read()
return self.response
def get_page(url):
"""Retrieve a webpage from the given url (don't follow redirects or use cookies, though)
>>> print 1000 < len(get_page('http://google.com')) < 1E7
True
"""
try:
return urllib.urlopen(url).read()
except:
return ''
# These extensive, complicated datetime regexes patterns don't work!
QUANT_PATTERNS = dict(
# HL: added some less common field/column separators: colon, vertical_bar
SEP = r'\s*[\s,;\|:]\s*',
DATE_SEP = r'\s*[\s,;\|\-\:\_\/]\s*',
# based on DATE_SEP (with \s !!) ORed with case insensitive connecting words like "to" and "'till"
RANGE_SEP = r"(?i)\s*(?:before|after|then|(?:(?:un)?(?:\')?til)|(?:(?:to)?[\s,;\|\-\:\_\/]{1,2}))\s*",
TIME_SEP = r'\s*[\s,;\|\-\:\_]\s*',
# HL: added sign, spacing, & exponential notation: 1.2E3 or +1.2 e -3
FLOAT = r'[+-]?\d+(?:\.\d+)?(?:\s?[eE]\s?[+-]?\d+)?',
FLOAT_NONEG = r'[+]?\d+(?:\.\d+)?(?:\s?[eE]\s?[+-]?\d+)?',
FLOAT_NOSIGN = r'\d+(?:\.\d+)?(?:\s?[eE]\s?[+-]?\d+)?',
# HL: got rid of exponential notation with an E and added x10^-4 or *10^23
FLOAT_NOE = r'[+-]?\d+(?:\.\d+)?(?:\s?[xX*]10\s?\^\s?[+-]?\d+)?',
FLOAT_NONEG_NOE = r'[+]?\d+(?:\.\d+)?(?:\s?[xX*]10\s?\^\s?[+-]?\d+)?',
FLOAT_NOSIGN_NOE = r'\d+(?:\.\d+)?(?:\s?[xX*]10\s?\^\s?[+-]?\d+)?',
# HL: added sign and exponential notation: +1e6 -100 e +3
INT = r'[+-]?\d+(?:\s?[eE]\s?[+]?\d+)?',
INT_NONEG = r'[+]?\d+(?:\s?[eE]\s?[+]?\d+)?',
INT_NOSIGN = r'\d+(?:\s?[eE]\s?[+]?\d+)?', # HL: exponents should always be allowed a sign
INT_NOSIGN_2DIGIT = r'\d\d',
INT_NOSIGN_4DIGIT = r'\d\d\d\d',
INT_NOSIGN_2OR4DIGIT = r'(?:\d\d){1,2}',
YEAR = r'(?i)(?:1[0-9]|2[012]|[1-9])?\d?\d(?:\s?AD|BC)?', # 2299 BC - 2299 AD, no sign
MONTH = r'[01]\d|\d', # 01-12
DAY = r'[0-2]\d|3[01]|[1-9]', # 01-31 or 1-9
HOUR = r'[0-1]\d|2[0-4]|\d', # 01-24 or 0-9
MINUTE = r'[0-5]\d', # 00-59
SECOND = r'[0-5]\d(?:\.\d+)?', # 00-59
)
DATE_PATTERN = re.compile(r"""
(?P<y>%(YEAR)s)%(DATE_SEP)s
(?P<mon>%(MONTH)s)%(DATE_SEP)s
(?P<d>%(DAY)s)
""" % QUANT_PATTERNS, re.X)
TIME_PATTERN = re.compile(r"""
(?P<h>%(HOUR)s)%(TIME_SEP)s
(?P<m>%(MINUTE)s)(?:%(TIME_SEP)s
(?P<s>%(SECOND)s))?
""" % QUANT_PATTERNS, re.X)
DATETIME_PATTERN = re.compile(r'(?P<date>'+DATE_PATTERN.pattern+
r')(?:'+QUANT_PATTERNS['DATE_SEP']+
r')?(?P<time>'+TIME_PATTERN.pattern+r')?', re.X)
def zero_if_none(x):
if not x:
return 0
return x
def parse_date(s):
"""Nested regular expressions to proces date-time strings
>>> parse_date('2001-2-3 4:56:54.123456789')
datetime.datetime(2001, 2, 3, 4, 56, 54, 123456)
Values for seconds or minutes that exceed 60 are ignored
>>> parse_date('2001-2-3 4:56:78.910')
datetime.datetime(2001, 2, 3, 4, 56)
>>> parse_date('2012-04-20 23:59:00')
datetime.datetime(2012, 4, 20, 23, 59)
>>> parse_date('1776-07-04')
datetime.datetime(1776, 7, 4, 23)
>>> parse_date('2012-04-12 13:34')
datetime.datetime(2012, 4, 20, 13, 34)
"""
from math import floor
mo=DATETIME_PATTERN.search(s)
if mo:
y = mo.group('y') or 0
if len(y) == 2:
if y[0] == '0':
y = int(y) + 2000
# else:
# y = int(y)
# if y > 20 and y < 100:
# y = y + 1900
y = int(y)
mon = int(zero_if_none(mo.group('mon')))
d = int(zero_if_none(mo.group('d')))
h = int(zero_if_none(mo.group('h')))
m = int(zero_if_none(mo.group('m')))
s_f = float(zero_if_none(mo.group('s')))
s = int(floor(s_f))
us = int((s_f-s)*1000000.0)
return datetime.datetime(y,mon,d,h,m,s,us)
else:
raise ValueError("Date time string not recognizeable or not within a valid date range (2199 BC to 2199 AD): %s" % s)
def parse_time(s):
"""Nested regular expressions to time strings
>>> parse_time('4:56:54.123456789')
datetime.time(4, 56, 54, 123456)
"""
from math import floor
mo=TIME_PATTERN.search(s)
if mo:
h = int(zero_if_none(mo.group('h')))
m = int(zero_if_none(mo.group('m')))
s_f = float(zero_if_none(mo.group('s')))
s = int(floor(s_f))
us = int((s_f-s)*1000000.0)
# FIXME: parse the AM/PM bit
return datetime.time(h,m,s,us)
else:
raise ValueError("Time string not recognizeable or not within a valid date range (00:00:00 to 24:00:00): %s" % s)
def get_next_target(page):
"""Extract a URL from a string (HTML for a webpage)
>>> print get_next_target('hello <a href="world">.</a>')
('world', 20)
"""
start_link = page.find('<a href=')
if start_link == -1:
return None, 0
start_quote = page.find('"', start_link)
end_quote = page.find('"', start_quote + 1)
url = page[start_quote + 1:end_quote]
return url, end_quote
def union(p,q):
for e in q:
if e not in p:
p.append(e)
def interp_multicol(lol,newx=None):
"""Linearly interpolate mulitple columns of data. First column is independent variable.
>>> interp_multicol([range(6),[float(x)**1.5 for x in range(6)],range(3,-3,-1)],[0.4*x for x in range(15)])
"""
#print lol
#lol = make_wide(lol)
lol = transpose_lists(lol)
x=lol[0]
#print lol[1:]
for c,col in enumerate(lol[1:]):
#print c,col,x,newx
lol[c+1] = interpolate(x,col,newx)
def interpolate(x,y,newx=None,method='linear',verbose=True):
"""
Interpolate y for newx.
y and newx must be the same length
>>> interpolate([0,1,2],[5,6,7],[-.5,0,.33,.66,1.,1.33,1.66,2,2.5])
[5.0, 5.0, 5.33, 5.66, 6.0, 6.33, 6.66, 7.0, 7.0]
>>> interpolate([0,3,4],[1,2,3])
[1.0, 1.6666666666666665, 3.0]
"""
# TODO: walk the dimensions of the lists, doing a size() to find which
# dimensions correspond (x <--> y) so that the interpolation vector
# lengths match
N = len(x)
if not len(y) == N:
raise ValueError('Interpolated lists must be the same length, even if the dependent variable (y) has more dimensions')
newy = []
if isinstance(x[0],(float,int,str)) and isinstance(y[0],(list,tuple)):
if verbose:
print 'interpolate() is trying for size(x)='+repr(size(x))+' size(y)='+repr(size(y2))+'size(newx)='+repr(size(newx))
x2 = []
for j in range(len(y[0])):
x2 += []
for i,x2 in enumerate(x):
x2[i][j] = x
return interpolate(x2,y,newx,method,verbose) # FIXME: doesn't work for 2-D y and 1-D x
for j in range(len(y[0])):
y1=[]
for i in range(len(y)):
if j<len(y[i]):
y1 += y[i][j]
newy += interpolate(x=x, y=y1, newx=newx, method=method, verbose=verbose)
return newy
elif isinstance(x[0],(list,tuple)) and isinstance(y[0],(list,tuple)):
# TODO: check the length of x[0] and y[0] to see which dimension in y corresponds to x
return [ interpolate(x1,y1,newx,method,verbose=verbose) for x1,y1 in zip(x,y) ]
# TODO: now that we're at the innermost dimension of the 2 lists, we need
# to check that the length of the x and y and/or newx lists match
# TODO: sort x,y (together as pairs of tuples) before interpolating, then unsort when done
if not newx:
N = max(len(x),2)
newx = [float(x1*(x[-1]-x[0]))/(N-1)+x[0] for x1 in range(len(x))]
#if newx and len(newx)>1:
#print make_wide(newx)
N=len(newx)
if not len(x)==len(y):
raise ValueError("Can't interpolate() for size(x)="+repr(size(x))+' size(y)='+repr(size(y))+'size(newx)='+repr(size(newx)))
if method.lower().startswith('lin'):
i, j, x0, y0 = 0, 0, newx[0], y[0]
while i < len(x) and j<N:
# no back-in-time extrapolation... yet
if x[i] <= newx[j]:
x0, y0 = float(x[i]), float(y[i])
i += 1
else:
if x[i] != x0: # check for divide by zero
newy.append((float(y[i])-y0)*(float(newx[j])-x0)/(float(x[i])-x0)+y0)
else: #nearest neighbor is fine if interpolation distance is zero!
newy.append(float(y0))
# if j>=N-1: # we've finished the last newx value
# break
j = j+1
# no extrapolation, assume time stops ;)
for j in range(j,N):
newy.append(float(y[-1]))
else:
raise(NotImplementedError('Interpolation method not implemented'))
return newy
def var2(listoflist):# assuming equal datetime intervals
"""
:Author: Nat
"""
averagelist=[]
variance =0
for element in listoflist:
#print 'element=',element
#print 'element[1]=',element[1]
averagelist.append(element[1])# appends average value from listoflist
sumlist = sum(averagelist)
meanavg = sumlist/len(averagelist)#mean of the list containing all the 'average' data
#print'meanavg=',meanavg
for e in averagelist:
variance = variance + (e - meanavg)**2
variance = variance/len(averagelist)
return variance
def wikipedia_view_rates(articles=['Bitcoin','James_Surowiecki'],verbose=False,names=''):
# TODO: make this a 2-D array with the article title and various view rate stats for each element in articles
dat = dict()
if not names:
name = 'wikipedia_view_rate'
elif isinstance(names,str):
name=names
elif isinstance(names,list) and len(names)==len(articles):
for i,article in enumerate(articles):
dat[name[i]] = wikipedia_view_rate(article=article,verbose=verbose)
return dat
for article in articles:
#if verbose:
print 'Checking wikipedia view rate for "'+article+'"'
dat[name+'_'+article] = wikipedia_view_rate(article=article,verbose=verbose)
return dat
def wikipedia_view_rate(article='Bitcoin',verbose=False):
return mine_data(url='http://stats.grok.se/en/latest/'+article,
prefixes=r'[Hh]as\sbeen\sviewed\s',
regexes=r'[0-9,]{1,12}',
names='view_rate_'+article,
verbose=verbose)
def get_all_links(page):
links = []
while True:
url,endpos = get_next_target(page)
if url:
links.append(url)
page = page[endpos:]
else:
break
return links # could use set() to filter out duplicates
# TODO: compute and return other statistics about the page associated with the page:
# 1. page length
# 2. keywords & frequencies (use the CS101 multi-word indexer?)
# 3. meta data accuracy
# 4. number of links
# 5. depth
# 6. number of broken lengths
# 7. number of spelling errors and/or some grammar errors (the ones that are easy to detect reliably)
def get_links(url='https://en.bitcoin.it/wiki/Trade',max_depth=1,max_breadth=1e6,max_links=1e6,verbose=False,name=''):
""" Return a list of all the urls linked to from a page, exploring the graph to the specified depth.
uses the get_page() get_all_links() functions from the early part of CS101, should be updated using more recent CS101 code
TODO:
set default url if not url
BUG: tries to browse to weird URLs and bookmarks, e.g. "href=#Printing"
need to count stats like how many are local and how many unique second and top level domain names there are
"""
tocrawl = [url]
crawled = []
depthtocrawl = [0]*len(tocrawl)
depth = 0
page = tocrawl.pop()
depth = depthtocrawl.pop()
links = 0
if verbose:
print 'Counting links by crawling URL "'+url+'" to a depth of '+str(max_depth)+'...'
if not name:
name = 'data'
while depth<=max_depth and links<max_links:
links += 1
if page not in crawled:
i0=len(tocrawl)
link_urls = set(get_all_links(get_page(page))) # set() makes sure all links are unique
union(tocrawl, link_urls)
if verbose:
print 'Retrieved '+str(len(link_urls))+' links at "'+ page + '"'
crawled.append(page)
for i in range(i0,len(tocrawl)):
depthtocrawl.append(depth+1)
if not tocrawl: break
page = tocrawl.pop(0) # FIFO to insure breadth first search
depth = depthtocrawl.pop(0) # FIFO
dt = datetime.datetime.now(tz=Local)
return {name:{'datetime':str(dt),'url':url,'links':len(crawled),'depth':max_depth}}
# TODO: set default url if not url
def rest_json(url='https://api.bitfloor.com/book/L2/1',verbose=False):
if verbose:
print 'Getting REST data from URL "'+url+'" ...'
data_str = Bot().GET(url)
dt = datetime.datetime.now(tz=Local)
if verbose:
print 'Retrieved a '+str(len(data_str))+'-character JSON string at '+ str(dt)
if len(data_str)>2:
data = json.loads( data_str )
data['datetime']=str(dt)
data['url']=url
data['len']=len(data_str)
# this name needs to reflect the URL specified as an input rather than a hard-coded name
if verbose:
print data
return data
return None
# FIXME: ANTIIDIOM
#def readable(path):
# try:
# f = open(path,'r')
# # return f # but this makes readable() just like an f = open(... wrapped in a try
# f.close()
# return True
# except:
# return False
def file_is_readable(file):
with open(file) as fp:
return fp.readline()
# unfortunately the file will close when this returns so you can't just keep reading it
# better to do this within the code where this was called to continue using fp while it's open
# VERY ANTI-IDIOMATIC
def updateable(path,initial_content='',min_size=0):
if initial_content:
min_size = max(min_size,len(initial_content))
#TODO: use os.path_exists instead of try
if not min_size:
try:
f = open(path,'r+') # w = create the file if it doesn't already exist, truncate to zero length if it does
except:
return False
f.close()
return True
else:
if file_is_readable(path): # don't open for writing because that will create and truncate it
try:
f = open(path,'r+')
except:
return False
f.seek(0,2) # go to position 0 relative to 2=EOF (1=current, 0=begin)
if f.tell()>=min_size:
f.close()
return True
else:
f.close()
if initial_content:
f = open(path,'w')
f.write(initial_content)
f.close()
return True
else:
try:
f = open(path,'w')
except:
return False
if initial_content:
f = open(path,'w')
f.write(initial_content)
f.close()
return True
return False
def parse_index(s):
"""Parses an array/list/matrix index string.
>>> parse_index('[1][2]')
[1, 2]
>>> parse_index('[3][2][1]') # FIXME, commas and parentheses don't work !
[3, 2, 1]
"""
#return (0)
mo=re.match(r'[\[\(\{]+\s*(\d(?:\s*[:]\s*\d){0,2})+(?:\s*[\[\(\{\]\)\},; \t]\s*)+(\d(?:\s*[:]\s*\d){0,2})+\s*[\]\)\}]+',s)
# FIXME: needs to subparse the slice notation, e.g. '1:2:3'
# this only handles single indexes
return [int(s) for s in mo.groups()]
def parse_query(q):
"""Parse query string identifing records in bitcrawl_historical_data.json
Returns a 3 equal-length lists
sites = a name (key) for the webpage where data was originally found
values = the key for the values on the webpages identified by sites
datetimes = list of lists with the datetimes for which data is desired
Friendlier interface for bycol_key() which is turn used for
forecast_data() & plot_data()
>>> parse_query('bitfloor.bids[0][0] date:2012-04-12 13:35')
(['bitfloor'], ['bids'],...)
"""
# print 'query', q
if q and isinstance(q,(list,set)): # a tuple is the return value for a single query string!
retval = [ parse_query(s) for s in q ]
retval = transpose_lists(retval)
return retval[0], retval[1], retval[2]
sites = []
values = []
datetimes = []
if q and isinstance(q,str):
# TODO: generalize query parsing with regex
tok = q.split(' ') # only space may separate query terms from tags
u,v = tok[0].split('.')
sites.append(u)
# FIXME: this will break if different types of braces are used in a row
n = max(v.find('['),v.find('('),v.find('{'))
if n>-1:
values.append(v[:n])
indexes=parse_index(v[n:])
else:
values.append(v)
datetimes.append([])
i = 1 # the first token must always be the site.value "URI"
while i < len(tok):
t = tok[i]
i += 1
# print 'tok',tok
if t.lower().startswith('date') and len(t)>5 and t[4] in ":= \t|,;":
if i<len(tok):
if isinstance(parse_time(tok[i]),datetime.time):
t += ' '+str(tok[i])
i += 1
datetimes[-1].append(parse_date(t[6:]))
# FIXME: this is no longer required for auto-correlation, unless you just want to double-check the algorithm
#sites = [sites,sites] if isinstance(sites,str) else sites
#values = [values,values] if isinstance(values,str) else values
# print '-'*10
# print sites
# print '-'*10
# print values
# print '-'*10
# print datetimes
# print '-'*10
return [sites[0], values[0], datetimes]
#return (s[0] for s in sitevalues, datetimes
# warn('Unable to identify the sites and values that query string attempted to retrieve. '+
# ' \n query = '+ str(q)+
# ' \n sites = '+ str(sites)+
# ' \n values = '+ str(values) )
# TODO: incoporate interpolation
def retrieve_data(sites='mtgox',values='average', datetimes=None, filepath=None, verbose=True):
"""
Retrieve data from bitcrawl_historical_data.json for sites and values specified
>>> retrieve_data('mtgox','last', ['2012-04-12 13:34','2012-04-15 13:35'])
[[734605.0, 734606.0, 734607.0, 734608.0],
[4.8833, 4.85660013796317, 4.890036645675644, 4.975431555147281]]
Surprisingly this doesn't retrieve the volumes, just the prices for bf bids
And the dimensions seem weird
>>> retrieve_data('bitfloor','bids', ['2012-04-12 13:34','2012-04-12 13:35'])
[[734605.0], [[[4.88], [4.87], [4.86], ... [4.6], [4.59], [4.58]]]]
"""
if isinstance(sites,list) and isinstance(values,list):
if verbose:
print 'sites and values are lists'
return [ retrieve_data(s,v) for s,v in zip(sites,values) ]
if isinstance(sites,list) and isinstance(values, str):
return [ retrieve_data(s,values) for s in sites ]
if isinstance(sites, str) and isinstance(values,list):
return [ retrieve_data(sites, v) for v in values ]
rows = []
if isinstance(values,str) and isinstance(sites,str):
if verbose:
print 'retrieving a single data series, ('+sites+','+values+').'
# very inefficient to reload data with every time series retrieved
data = load_json(filepath, verbose=False) # None filepath loads data from default path
if not data:
warn('Historical data could not be loaded from '+repr(filepath))
return []
rows = byrow_key(data, name=sites, yname=values, xname='datetime', verbose=False)
else:
warn('Invalid site key '+repr(sites)+', or value key '+repr(values))
return None
if not (isinstance(rows,list) or not isinstance(rows[0],list)): # should always be an Nx2 matrix with each element either a value or a list of M values
warn('Unable to find matching data of type '+repr(type(columns))+
' using site key '+repr(sites)+' and value key '+repr(values))
return None
NM = size(rows)
if verbose:
print "Retrieved an array of historical data records size",NM
if (not rows or not isinstance(NM,(list,tuple)) or len(NM)<2 or
any([nm<1 for nm in NM]) ):
print "Retrieved 1 or fewer data points, which is unusual."
return rows
t = []
if not datetimes:
# interpolate columns data to create regularly-space, e.g.daily or bi-daily, values
t = [float(x) for x in range(int(min(rows[0])),
int(max(rows[0]))+1)]
else:
t = datetime2float(datetimes) # this will be a float or list of floats
rows[1] = interpolate(x=rows[0], y=rows[1], newx=t, verbose=verbose)
rows[0] = t
# this can never happen
# i=1
# while i<len(sites):
# s,v = sites[i],values[i]
# rows2 = byrow_key(data,name=s,yname=v,xname='datetime')
# if len(rows2)<2:
# break
# # interpolate the new data to line up in time with the original data
# #print len(cols2[0]), len(cols2[1]),len(columns[0])
# newrow = interpolate(rows2[0], rows2[1], newx=t, verbose=verbose)
# #print newrow
# #print columns
# rows.append(newrow)
# #print columns
# i += 1
return rows
def query_data(q,filepath=None):
"""Retrieve data from bitcrawl_historical_data.json that matches a query string
Friendlier interface for bycol_key() which is turn used for
forecast_data() & plot_data()
>>> query_data('bitfloor.bids[0][0] date:2012-04-12 13:35')
4.88
"""
sites, values, datetimes = parse_query(q)
return retrieve_data(sites=sites, values=values,
datetimes=datetimes, filepath=filepath)
def bycol_key(data, name='mtgox', yname='average', xname='datetime',verbose=False):#function for returning values given a key of the dictionary data
columns =[] # list of pairs of values
# loops thru each data item in the list
if not data:
warn('No data provided')
return []
for record in data:
# if this record (dict) contains the named key (e.g. 'mtgox')
#print 'looking for '+name
if name in record:
#print '-------- found '+name
keyrecord = record[name]
#print 'keyrecord=',keyrecord
#print 'type(keyrecord)=',type(keyrecord)
#print 'size(kr)=',size(keyrecord)
# is the requested x data name in the dictionary for the record?
# don't create a list entry for data points unless both x and y are available
if keyrecord and xname in keyrecord and yname in keyrecord:
# add the time to the empty row
dt = datetime2float(parse_date(keyrecord[xname]))
value = list2float(keyrecord[yname])
if dt and value and MIN_ORDINAL <= dt <= MAX_ORDINAL: # dates before 1800 don't make sense
columns.append([dt,value])
else:
warn('The record named '+repr(name)+' of type '+repr(type(keyrecord))+' did not contain x ('+repr(xname)+') or y ('+repr(yname)+') data. Historical data file may be corrupt.')
if verbose:
pprint(columns,indent=2)
return columns
def byrow_key(data, name='mtgox', yname='average', xname='datetime',verbose=False):
cols = bycol_key(data=data, name=name, yname=yname, xname=xname,verbose=verbose)
NM = size(cols)
# don't try to transpose anything that isn't a list of lists
if (isinstance(NM,(list,tuple)) and len(NM)>1 and NM[0]>0 and NM[0]>0
and any([n>1 for n in NM]) ):
return(transpose_lists(cols))
else: return cols
def str2float(s=''):
"""Convert value string from a webpage into a float
Processes commas, units, and magnitude letters (G or B,K or k,M,m)
>>> str2float('$5.125 M USD')
5125000.0
"""
# save the original string for warning message printout and debugging
try:
return float(s)
except:
pass
try:
s0 = str(s)
except:
try:
s0 = unicode(s)
except:
raise ValueError('Unable to interpret string '+repr(s)+' as a float')
warn('Non-ascii object '+repr(s)+' passed to str2float')
s = s0.strip()
mag = 1.
scale = 1.
# TODO: add more (all Standard International prefixes)
mags = {'G':1e9,'M':1e6,'k':1e3,'K':1e3,'m':1e-3}
# factors just for reference not for conversion
# TODO: pull factors from most recent conversion rate data in historical file or config file?
# TODO: add national currencies and a few digital ones (e.g. Linden dollars)
units = {'$':1.,'USD':1.,'AUD':1.3,'BTC':5.,'EU':2.,'bit':1e-3}
# TODO: DRY-out
s=s.strip()
if s.lower().find('kb')>=0:
s=s.replace('KB','K').replace('kb','K').replace('Kb','K').replace('kB','K')
for k,m in mags.items():
if s.rfind(k) >= 0:
mag *= m
s=s.replace(k,'')
for k,u in units.items():
s=s.replace(k,'')
# scale/units-value unused, TODO: return in a tuple?
scale *= u
s=s.strip()
try:
return float( s.replace(',','').strip() )*mag
except:
warn('Unable to interpret string '+repr(s0)+'->'+repr(s)+' as a number')
return s # could return None
def list2float(s=''):
"""Convert a multi-dimensional list of strings to a multi-D list of floats
Processes commas, units, and magnitude letters (G or B,K or k,M,m)
>>> list2float([['$5.125 M USD','0.123 kB'],[1e-9]])
[[5125000.0, 123.0], [1e-09]]
"""
# convert some common iterables into lists
if isinstance(s,(list,set,tuple)):
return [list2float(x) for x in s]
try:
# maybe one day the float conversion will be 'vectorized'! ;)
return float(s)
except:
if not s:
return NaN
if not isinstance(s,(str,unicode)):
warn("Unable to interpret non-string data "+repr(s)+" which is of type "+str(type(s)))
# convert bools and NoneTypes to 0. Empty lists have already returned.
if not s:
return 0.
# TODO: check tg.nlp.is_bool() before converting to numerical float
return str2float(s)
def datetime2float(dt=None):
"""Convert datetime object to a float, the ordinal number of days since epoch (0001-01-01 00:00:00)
>>> datetime2float(datetime.datetime(2012,4,20,23,59,59,999999))
734613.999988426
"""
if isinstance(dt,(list,set,tuple)):
if len(dt)==2: # 2-length date vectors are interpreted as the bounds of a daily series
# min max slice shenanigans is to get this snippet closer to something more general
dt = [ min( datetime2float(dt[:-1])),
max( datetime2float(dt[-1:])) ]
return [ float(x) for x in range(int(min(dt)), int(max(dt))+1) ]
else:
return [ datetime2float(x) for x in dt ]
if isinstance(dt, str):
return datetime2float(parse_date(dt))
if isinstance(dt, datetime.datetime):
return float(dt.toordinal())+dt.hour/24.+dt.minute/24./60.+dt.second/24./3600.
try:
return float(dt)
except:
return dt or []
def cov(A,B):
"""Covariance of 2 equal-length lists of scalars"""
ma = mean(A)
mb = mean(B)
return sum([(a-ma)*(b-mb) for a,b in zip(A,B)])/len(A)
def pearson(A,B):
"""Pearson correlation coefficient between 2 equal-length lists of scalars"""
return cov(A,B)/std(A)/std(B)
def lag_correlate(rows, lead=1, verbose=True):