-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathfinal.py
548 lines (459 loc) · 21.9 KB
/
final.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
"""Final project: Random Writer that can save its model.
NOTE: This is a long file, however I strongly recommend you read the
whole thing BEFORE starting on the project.
In this project you will write a "random writer": a program that
builds a statistical model of some data and then outputs a stream of
data that is similar to the original but randomly generated. Here is
how we will do it: [The idea for this assignment is from an assignment
that Dr Calvin Lin uses for CS314H. He was nice enough to let me use
his text to explain it.]
Imagine taking a book, such as Tom Sawyer, and determining the
probability with which each character occurs. You'd probably find that
spaces are the most common character, followed by the character "e",
etc. Given these probabilities, which we will call a Level 0 analysis,
you could randomly produce text that, while not resembling English,
would have the property that the characters would likely occur in the
same proportions as they do in Tom Sawyer. For example, here's what
you might produce:
rla bsht eS ststofo hhfosdsdewno oe wee h .mr ae irii ela iad o r te
ut mnyto onmalysnce, ifu en c fDwn oee
Now imagine doing a slightly more sophisticated analysis, a Level 1
analysis, that determines the probability with which each character
follows every other character. You would probably discover that "h"
follows "t" more frequently than "x" does, and you would probably
discover that a space follows a period more frequently than a comma
does. With this new analysis, you could use the probabilities from
Tom Sawyer to randomly pick an initial character and then repeatedly
choose the next character based on the previous character and the
probabilities provided by the analysis. Your new text might look like
the following, which looks a bit more like English than the previous
example:
"Shand tucthiney m?" le ollds mind Theybooue He, he s whit Pereg
lenigabo Jodind alllld ashanthe ainofevids tre lin-p asto oun
theanthadomoere
We can generalize these ideas to a Level k analysis that determines
the probability with which each character follows every possible
sequence of k characters. For example, a Level 5 analysis of Tom
Sawyer would show that "r" follows "Sawye" more frequently than any
other character. After a Level k analysis, you'd be able to produce
random text by always choosing the next character based on the
previous k characters -- which we will call the state -- and based on
the probabilities produced by your analysis.
For relatively small values of k (5-7), the randomly generated text
begins to take on many of the characteristics of the source text.
While it still will not produce legal English, you will be able to
tell that it was derived from Tom Sawyer instead of Harry Potter. As
the value of k increases, the text looks increasingly like English.
Here are some more examples:
Level 2:
"Yess been." for gothin, Tome oso; ing, in to weliss of an'te cle -
armit. Paper a comeasione, and smomenty, fropech hinticer, sid, a
was Tom, be such tied. He sis tred a youck to themen
Level 4:
en themself, Mr. Welshman, but him awoke, the balmy shore. I'll
give him that he couple overy because in the slated snuffindeed
structure's kind was rath. She said that the wound the door a fever
eyes that WITH him.
Level 6:
people had eaten, leaving. Come - didn't stand it better judgment;
His hands and bury it again, tramped herself! She'd never would
be. He found her spite of anything the one was a prime feature
sunset, and hit upon that of the forever.
Level 8:
look-a-here - I told you before, Joe. I've heard a pin drop. The
stillness was complete, how-ever, this is awful crime, beyond the
village was sufficient. He would be a good enough to get that
night, Tom and Becky.
Level 10:
you understanding that they don't come around in the cave should get
the word "beauteous" was over-fondled, and that together" and
decided that he might as we used to do - it's nobby fun. I'll learn
you."
We can also generalize this idea to use words in place of characters
as the "tokens" of the model. In fact we can generalize this process
to work over any sequence of values of any type. The states are then
short sequences of values.
Formally, the model we are building here is called a Markov Chain. A
Markov chain is a directed graph where every node is a state and every
outgoing edge from a node is a possible token to find while in that
state. The edges have probabilities that define how likely it is to
follow that edge. The probability of all the edges leaving a node
should sum to 1.
With this graph we can generate output by picking a random starting
node and then picking an outgoing edge based on the probabilities,
outputting it's associated token, and repeating the process based on
the node at the other end of the edge.
More concretely you can think of being in a state represented by a
string "th". We would then look at the probability that each other
letter would follow "th". "e" is likely to be very common; "x" not so
much. If we generate the "e", we are in state "he" and we check
probabilities based on that state.
Feel free to discuss the Markov chain with one another or look it
up. However you should not discuss how to implement it or copy an
existing implementation.
NOTE: This module can be used in various modes (level, tokenization,
etc). However, you should NOT need to reimplement large blocks of code
for each case. Instead, think about what actually changes from mode to
mode and use conditionals around only those part. You may even want to
write some utility methods which perform some abstract operations in
the appropriate way for the current mode. In the end, just remember
how much I hate duplicated or messy code as you handle all the cases.
"""
"""You will be turning in this assignment as a ZIP file, since you
will need to turn in 2 different modules (final.py and graph.py).
"""
"""TODO: Create a Tokenization enum with values: word, character,
byte, none (name them exactly this; do not capitalize). Use the
enumeration support in the standard library. The tokenization modes
have the following meanings:
word: Interpret the input as UTF-8 and split the input at any
white-space characters and use the strings between the white-space
as tokens. So "a b" would be ["a", "b"] as would "a\n b".
character: Interpret the input as UTF-8 and use the characters as
tokens.
byte: Read the input as raw bytes and use individual bytes as the
tokens.
none: Do not tokenize. The input must be an iterable.
"""
"""TODO: Create a module called 'graph' (make sure the file name is exactly
'graph.py') that contains class(es) and code that are used to represent and
manipulate the Markov chain graph. It should implement the Markov chain as an
abstract concept and not have any code specific to this usage. Also it should
use objects to represent the graph structure, instead of encoding relationships
as some opaque set of tables.
The API of the module is up to you, but it should be an API which I could use
to implement a different Markov-chain-related system without changing it. So
try to make the API generic and not specific to this application. I will be
grading the generality of the graph module.
NOTE: Do not optimized your graph representation. Just represent it in
the simplest way you can. That's how I implemented mine and it runs the
whole test suite in 9 seconds which is plenty fast enough for our
application and actually much faster than most implementations people
turn in.
"""
"""In implementing the algorithms required for this project you should
not focus too much on performance however you should make sure your
code can train on a large data set in a reasonable time. For instance,
a 6MB input file on level 4 should not take more than 10 seconds.
Similarly you should be able to generate output quickly; 5,000 tokens
per second or better. These should be easy to meet without any
optimization if you use the correct data structures (e.g., dicts when
you need fast lookup by a key).
You can get some test input from http://www.gutenberg.org/. You may
want to use the complete works of Shakespeare:
http://www.gutenberg.org/cache/epub/100/pg100.txt
"""
from enum import Enum, auto
import graph
import random
import urllib.request
import itertools
import pickle
from six import string_types
class Tokenization(Enum):
word = auto()
character = auto()
byte = auto()
none = None
@classmethod
def is_proper_enum(cls, item):
"""Class Method that tests whether or not item is a proper
enumeration
"""
return item is cls.word or item is cls.character or item is \
cls.byte or item is cls.none or item is cls.none.value
class RandomWriter(object):
"""A Markov chain based random data generator.
"""
def __init__(self, level, tokenization=None):
"""Initialize a random writer.
Args:
level: The context length or "level" of model to build.
tokenization: A value from Tokenization. This specifies how
the data should be tokenized.
The value given for tokenization will affect what types of
data are supported.
"""
# Error Check for proper input to the constructor
if level < 0:
raise ValueError(f"Error: level {repr(level)} must be positive")
if not Tokenization.is_proper_enum(tokenization):
raise ValueError(f"Error: tokenization {repr(tokenization)} must "
f"be of acceptable type")
# instance attributes
self._level = level
self._mode = tokenization
# initialize our Markov Chain
self.chain = graph.Graph()
# initialize our probability sum container to hold the total sum of
# all outgoing edge weights for our chain vertices
self.prob_sums = {}
@property
def level(self):
"""Getter that returns the given RandomWriter's level attribute.
"""
return self._level
@property
def mode(self):
"""Getter that returns the given RandomWriter's tokenization
attribute.
"""
return self._mode
def add_chain(self, data):
"""Function to add a new state to our Markov Chain, uses our Graph's
API to do so.
NOTE: state should be a tuple containing k-level amount of tokens,
make sure to unpack correctly
"""
# add a new vertex to our graph rep. of the chain
self.chain.add_vertex(data)
# add new entry for dictionary of (vertex, total_outgoing_edges) pairs
self.prob_sums[data] = 0
def has_conn(self, source, dest):
"""Function to test whether or not two chains in the Markov Chain
already have a connection
"""
return self.chain[source].has_edge(dest[-1])
def get_conn(self, source, dest):
"""Function to return the edge that connects two chains in the Markov
Chain
"""
return self.chain[source].get_edge(dest[-1])
def _incr_prob_sum(self, state):
"""Function to increment the number of outward connection of edges
for a given state.
"""
self.prob_sums[state] += 1
def add_conn(self, source, dest, token):
"""Function to build an edge from the source state to the destination
state with token specified.
"""
# add an edge from source's vert obj to dest's vert obj
self.chain[source].add_edge(self.chain[dest], token)
def _choose_rand_edge(self, vertex):
"""Randomly traversing the Markov Chain that we have already
constructed with our train algorithm, output a single token
Use our random graph traversal algorithm where we choose a number
between 1 the sum of the total weights, if that number is less than
the state we are currently evaluating, then we have found our next
path, otherwise we subtract that number from our randomly chosen
number and continue to evaluate the new difference against the next
state
"""
# index into our total probability sum taken from all of the
rand_val = random.randint(0, self.prob_sums[vertex.data])
# iterate over our current outgoing edges
for choice in vertex.outgoing_edges:
curr_edge = vertex.get_edge(choice)
# current edge's weight is less than the random val, success
if rand_val <= curr_edge.weight:
return curr_edge
# otherwise subtract that edge's weight from rand_val and continue
else:
rand_val -= curr_edge.weight
def generate(self):
"""Generate tokens using the model.
Yield random tokens using the model. The generator should
continue generating output indefinitely. This means we will cycle
over multiple vertices
It is possible for generation to get to a state that does not
have any outgoing edges. You should handle this by selecting a
new starting node at random and continuing.
"""
# randomly select a starting state until found one w/ outgoing edges
state = random.choice(list(self.chain.vertices))
vertex = self.chain[state]
# ensure that we at least pick a starting state w/ outgoing edges
while not vertex.outgoing_edges:
state = random.choice(list(self.chain.vertices))
vertex = self.chain[state]
# continue to traverse and generate output indefinitely
while True:
# choose an edge weighted-randomly
curr_edge = self._choose_rand_edge(vertex)
yield curr_edge.token
# go to the next vertex, taking the edge we just yielded
vertex = curr_edge.dest_vertex
# handle case where vertex has no outgoing edges
while not vertex.outgoing_edges:
state = random.choice(list(self.chain.vertices))
vertex = self.chain[state]
def generate_file(self, filename, amount):
"""Write a file using the model.
Args:
filename: The name of the file to write output to.
amount: The number of tokens to write.
For character or byte tokens this should just output the
tokens one after another. For any other type of token a space
should be added between tokens. Use str to convert values to
strings for printing.
Do not duplicate any code from generate.
Make sure to open the file in the appropriate mode.
"""
# open the file in byte mode if byte tokenized
fi = open(filename, "wb") if self.mode is Tokenization.byte else \
open(filename, "w", encoding="utf-8")
# only get the first "amount" elements in our generated data
for token in itertools.islice(self.generate(), amount):
# make sure we correctly format our output
if self.mode is Tokenization.word:
fi.write(token+" ")
elif self.mode is Tokenization.none or self.mode is None:
fi.write(str(token)+" ")
elif self.mode is Tokenization.byte:
fi.write(bytes([token]))
else:
fi.write(str(token))
fi.close()
def save_pickle(self, filename_or_file_object):
"""Write this model out as a Python pickle.
Args:
filename_or_file_object: A filename or file object to write
to. You need to support both.
If the argument is a file object you can assume it is opened
in binary mode.
"""
# file object
if hasattr(filename_or_file_object, "read"):
# save this RandomWriter to a pickle
pickle.dump(self, filename_or_file_object)
# file name
elif isinstance(filename_or_file_object, string_types):
# open the file in the correct mode
with open(filename_or_file_object, "wb") as fi:
pickle.dump(self, fi)
else:
raise ValueError(f"Error: {filename_or_file_object} is not a "
f"filename or file object")
@classmethod
def load_pickle(cls, filename_or_file_object):
"""Load a Python pickle and make sure it is in fact a model.
Args:
filename_or_file_object: A filename or file object to load
from. You need to support both.
Return:
A new instance of RandomWriter which contains the loaded
data.
If the argument is a file object you can assume it is opened
in binary mode.
"""
# file object
if hasattr(filename_or_file_object, "read"):
# save this RandomWriter to a pickle
pickle.load(filename_or_file_object)
# file name
elif isinstance(filename_or_file_object, string_types):
# open the file in the correct mode
with open(filename_or_file_object, "rb") as fi:
pickle.load(fi)
else:
raise ValueError(f"Error: {filename_or_file_object} is not a "
f"filename or file object")
def train_url(self, url):
"""Compute the probabilities based on the data downloaded from url.
Args:
url: The URL to download. Support any URL supported by
urllib.
This method is only supported if the tokenization mode is not
none.
Do not duplicate any code from train_iterable.
"""
# Ensure that the mode is correct
if self.mode is Tokenization.none or self.mode is None:
raise ValueError("Error: this type of training is only supported "
"if the tokenization mode is not none")
# Open the url and read in the data
with urllib.request.urlopen(url) as f:
# if byte mode, we don't have to decode
if self.mode is Tokenization.byte:
data = f.read()
# otherwise, make sure we decode as utf-8
else:
data = f.read().decode()
# train the data
self.train_iterable(data)
def _gen_tokenized_data(self, data, size):
"""Helper function to generate tokens of proper length based on the
level specified by this RandomWriter.
The windows are produced by sliding a window over the input data
which should be an iterable.
if tokenization is character/word then data is a string which is
iterable
else if tokenization is byte then data is a bytestream
NOTE: code taken from Arthur Peters' windowed() function in
final_tests.py
TODO: handle k = 0 level case
"""
# empty window to add elements to
window = list()
# iterate over the elements in the iterable
for elem in data:
# check if the size of the window is in the proper window
if len(window) < size:
# if so then add the next word/char/byte/token to the window
window.append(elem)
else:
# otherwise, pop front and add to back, sliding the window down
# by one unit
window.pop(0)
window.append(elem)
# if the window has reached specified size, yield the proper state
if len(window) == size:
yield tuple(window)
def _build_markov_chain(self, states):
"""Helper function to help build the Markov Chain graph and compute
the probabilities based on a generator of states w k-level lengths
"""
# iterate through our states two at a time, creating our graph and
# using the Markov chain algorithm to construct our probabilities
try:
# pop the first state from the generator and add it to the
# initially empty chain
old_state = next(states)
self.add_chain(old_state)
except StopIteration:
raise RuntimeError("empty generator! empty input string or we "
"messed up the conversion of iterables")
# iterate through the rest of the states
for new_state in states:
# does the next state already exists in the Markov Chain
if new_state in self.chain:
# does the old state already have a connection to the new state
if self.has_conn(old_state, new_state):
# increment the counter of occurrences for edge's token
self.get_conn(old_state, new_state).incr_weight(1)
else:
# create a new connection between old and new states
self.add_conn(old_state, new_state, new_state[-1])
# the next state is brand new and we need a new connection
else:
# build a new chain to add the the Markov chain
self.add_chain(new_state)
# add a connection from the old chain to the new chain
self.add_conn(old_state, new_state, new_state[-1])
# update the old state's total probability sum
self._incr_prob_sum(old_state)
# iterate to the next state
old_state = new_state
def train_iterable(self, data):
"""Compute the probabilities based on the data given.
Try to avoid storing all that data at one time, but if it is way
simpler to store it don't worry about it. For most input types
you will not need to store it.
"""
# if we are in word tokenization then split data along white spaces
if self.mode is Tokenization.word:
states = self._gen_tokenized_data(data.split(), self.level)
else:
states = self._gen_tokenized_data(data, self.level)
# build our Markov Chain based on the generator we've constructed from
# the input data
self._build_markov_chain(states)
"""Modules you will want to look at:
* enum
* pickle
* requests
You may use the requests if you like even though it is not in the
standard library. I will make sure it is installed on the testing
machine.
"""