Skip to content

Latest commit

 

History

History
413 lines (383 loc) · 7.86 KB

README.md

File metadata and controls

413 lines (383 loc) · 7.86 KB

Counting Words

This example gives a stream-based implementation that counts words in a given on-line source and then sorts these words according to their frequencies.

We need a small package of the name atscntrb-hx-teaching-bucs to turn the source referred to by a URL into a linear stream of characters. This package can be downloaded by executing make npm-install or by issuing the following command-line:

npm install atscntrb-hx-teaching-bucs

We can implement a function stream_by_url_ based on one of the name stream_by_command in the downloaded package:

local

#include
"$PATSHOMELOCS\
/atscntrb-hx-teaching-bucs/mylibies.hats"

#staload
BUCS520 =
$BUCS520_2016_FALL

in (* in-of-local *)

extern
fun
stream_by_url_
(url: string): stream_vt(char)

implement
stream_by_url_(url) =
$BUCS520.stream_by_command<>
  ("wget", $list{string}("-q", "-O", "-", url))

end // end of [local]

The function stream_by_url_ calls the command wget (with some options) to fetch the source referred to by a given URL. One can of course try to implement stream_by_url_ based on the command curl as well.

We use the following type aliases in the rest of the presentation:

typedef word = string
typedef nword = (int, string)

Given a list of words ordered ascendingly (according to the standard lexicographic ordering), the following function returns a list of pairs where each pair consists of a number and a distinct word such that the number indicates the number of times the word occurring in the original given list of words:

extern
fun
list_vt_word2nword
(ws: List_vt(word)): List0_vt(nword)

Probably the most interesting function in this example is the following one that turns a linear stream of chars into a linear stream of words:

extern
fun
stream_vt_char2word
(cs: stream_vt(char)): stream_vt(word)

implement
stream_vt_char2word
  (cs) =
  auxmain(cs) where
{
  fun
  auxmain
  (
  cs: stream_vt(char)
  ) : stream_vt(word) =
  (
    case+ !cs of
    | ~stream_vt_nil() =>
       stream_vt_make_nil()
    | ~stream_vt_cons(c0, cs) =>
      (
       if isalpha(c0)
         then $ldelay
              (auxmain_con(cs, list_vt_sing(L(c0))), ~(cs))
         else auxmain(cs)
      )
  )
  
  and
  auxmain_con
  (
  cs: stream_vt(char), w0: List0_vt(char)
  ) : stream_vt_con(word) =
  (
    case+ !cs of
    | ~stream_vt_nil() =>
       stream_vt_sing(string_make_rlist_vt(w0))
    | ~stream_vt_cons(c1, cs) =>
      (
       if isalpha(c1)
         then auxmain_con(cs, list_vt_cons(L(c1), w0))
         else stream_vt_cons(string_make_rlist_vt(w0), auxmain(cs))
      )
  )
} (* end of [stream_vt_char2word] *)

Note that each word is just a non-empty sequence of letters in the English alphabet. Also, each word in the returned stream consists of only lowercase letters. Both auxmain and auxmain_con are tail-recursive, presenting no risk of stack-overflow even when they are called on a linear stream of infinite length! In general, paying close attention to addressing potential risk of stack-overflow is of great importance in constructing code of high quality.

The function stream_vt_char2nword does the work of assembling:

extern
fun
stream_vt_char2nword
  (cs: stream_vt(char)): List0_vt(nword)

implement
stream_vt_char2nword(cs) = nws where
{
  val ws = stream_vt_char2word(cs)
  val ws = stream2list_vt<word>(ws)
  val ws = list_vt_mergesort_fun<word>(ws, lam(w1, w2) => compare(w1, w2))
  val nws = list_vt_word2nword(ws)
  val nws = list_vt_mergesort_fun<nword>(nws, lam(nw1, nw2) => ~compare(nw1.0, nw2.0))
}

The code implementing stream_vt_char2nword is self-explanatory.

When the default URL is used, the execution of the program in this example outputs the following table that lists the first 250 most frequently used words in the novel Moby-Dick by Herman Melville:

1	the -> 14715
2	of -> 6742
3	and -> 6517
4	a -> 4805
5	to -> 4707
6	in -> 4241
7	that -> 3100
8	it -> 2536
9	his -> 2532
10	i -> 2127
11	he -> 1900
12	s -> 1825
13	but -> 1823
14	with -> 1770
15	as -> 1753
16	is -> 1751
17	was -> 1646
18	for -> 1644
19	all -> 1545
20	this -> 1443
21	at -> 1335
22	whale -> 1245
23	by -> 1227
24	not -> 1173
25	from -> 1105
26	on -> 1073
27	him -> 1069
28	so -> 1066
29	be -> 1064
30	you -> 964
31	one -> 925
32	there -> 871
33	or -> 798
34	now -> 786
35	had -> 779
36	have -> 774
37	were -> 683
38	they -> 670
39	which -> 655
40	like -> 647
41	me -> 633
42	then -> 631
43	their -> 620
44	are -> 619
45	some -> 619
46	what -> 619
47	when -> 607
48	an -> 600
49	no -> 596
50	my -> 589
51	upon -> 568
52	out -> 539
53	man -> 530
54	up -> 526
55	into -> 523
56	ship -> 519
57	ahab -> 517
58	more -> 509
59	if -> 501
60	them -> 474
61	ye -> 473
62	we -> 469
63	sea -> 455
64	old -> 452
65	would -> 432
66	other -> 431
67	been -> 415
68	over -> 410
69	these -> 406
70	will -> 399
71	though -> 384
72	its -> 382
73	down -> 379
74	only -> 378
75	such -> 376
76	who -> 366
77	any -> 364
78	head -> 348
79	yet -> 345
80	boat -> 337
81	long -> 334
82	time -> 334
83	her -> 332
84	captain -> 329
85	do -> 324
86	here -> 324
87	very -> 323
88	about -> 318
89	still -> 312
90	than -> 311
91	chapter -> 308
92	great -> 307
93	those -> 307
94	said -> 305
95	before -> 301
96	two -> 298
97	has -> 294
98	must -> 293
99	t -> 291
100	most -> 285
101	seemed -> 283
102	white -> 281
103	last -> 278
104	see -> 275
105	way -> 273
106	whales -> 272
107	thou -> 271
108	after -> 270
109	again -> 263
110	stubb -> 261
111	how -> 259
112	did -> 258
113	your -> 258
114	may -> 255
115	queequeg -> 253
116	little -> 249
117	can -> 247
118	round -> 247
119	while -> 246
120	sperm -> 245
121	three -> 245
122	men -> 244
123	say -> 244
124	first -> 239
125	through -> 235
126	us -> 234
127	every -> 232
128	well -> 230
129	being -> 225
130	much -> 224
131	where -> 223
132	off -> 220
133	could -> 217
134	good -> 216
135	hand -> 215
136	same -> 215
137	our -> 211
138	side -> 208
139	ever -> 206
140	never -> 206
141	himself -> 205
142	look -> 205
143	own -> 205
144	deck -> 199
145	starbuck -> 199
146	almost -> 197
147	go -> 194
148	even -> 193
149	water -> 190
150	thing -> 188
151	away -> 186
152	should -> 185
153	too -> 185
154	might -> 183
155	come -> 180
156	day -> 179
157	made -> 178
158	pequod -> 178
159	life -> 176
160	world -> 176
161	sir -> 175
162	fish -> 171
163	many -> 168
164	among -> 167
165	far -> 165
166	seen -> 165
167	back -> 164
168	without -> 164
169	line -> 160
170	let -> 158
171	oh -> 157
172	right -> 157
173	cried -> 156
174	eyes -> 156
175	nor -> 156
176	aye -> 155
177	god -> 153
178	know -> 153
179	part -> 153
180	night -> 152
181	sort -> 152
182	thought -> 150
183	once -> 149
184	boats -> 147
185	air -> 143
186	crew -> 141
187	don -> 140
188	take -> 137
189	whole -> 137
190	full -> 136
191	half -> 136
192	against -> 135
193	tell -> 135
194	things -> 134
195	thus -> 134
196	whaling -> 133
197	thee -> 131
198	came -> 130
199	hands -> 130
200	mast -> 130
201	small -> 130
202	soon -> 130
203	each -> 129
204	feet -> 127
205	both -> 126
206	under -> 126
207	something -> 123
208	till -> 123
209	think -> 122
210	between -> 120
211	she -> 120
212	why -> 119
213	found -> 118
214	just -> 117
215	place -> 117
216	called -> 116
217	saw -> 116
218	another -> 115
219	ll -> 115
220	make -> 115
221	nothing -> 115
222	towards -> 115
223	poor -> 114
224	thy -> 113
225	times -> 112
226	along -> 110
227	body -> 110
228	heard -> 110
229	work -> 110
230	flask -> 109
231	high -> 108
232	stand -> 107
233	moment -> 105
234	sight -> 105
235	end -> 103
236	voyage -> 103
237	new -> 102
238	sail -> 102
239	sun -> 102
240	hold -> 99
241	shall -> 99
242	does -> 98
243	strange -> 98
244	nantucket -> 97
245	went -> 97
246	years -> 97
247	however -> 96
248	leviathan -> 96
249	face -> 95
250	few -> 95

It is not surprising to see the word whale as the first noun in the list (word#22): The novel is all about whales and whaling!

Happy programming in ATS!!!