-
Notifications
You must be signed in to change notification settings - Fork 1
/
RosieQuestions.txt
971 lines (775 loc) · 62.3 KB
/
RosieQuestions.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
Rosie Rust Crate Questions:
Section 1, Build & Linking
Q-01.01: <<<<----CLOSED, Fixed---->>>> Does static-linking librosie into a compiled executable make any sense?
The intention here is to reduce deployment complexity for client apps, since most systems are unlikely to have librosie
already installed. However, if Rosie itself has many other dependencies or if there is no easy way to bundle the
standard pattern library then static linking may not accomplish anything.
Jamie said:
We used to build librosie.a and it needs only libc. No one was using it, so
it dropped out of the Makefile the last time we reworked the build process.
As you point out, installing a static lib is not sufficient because you also
need the compiled lua files and the standard pattern library, too.
Btw, there is a subtlety wrt the standard pattern library. In theory you
don't need it. It's convenient, sure. And it's architecture and OS
independent, and the RPL files do not undergo any "build" step -- they are
just "pattern source". In reality, Rosie is self-hosting in the sense that
the syntax for RPL is defined in rpl/rosie/rpl_1_3.rpl, and this file is
read when rosie (librosie) starts up.
The (lua) code src/lua/parse_core.lua can parse a restricted subset of RPL,
and rpl/rosie/rpl_1_3.rpl is written in that.
We could easily eliminate the run-time dependency on rpl/rosie/rpl_1_3.rpl.
We should build the RPL parser once, at build time, and bake it into
librosie as the default RPL parser. This should improve startup time, too,
although no one has complained about that yet. Probably because it's truly
a start-up cost in that it happens once, not once per engine. Engine
creation is really cheap.
To get the most benefit from a static library, I'd probably pursue:
- Link a librosie.a
- Use an existing tool to package all the compiled lua files into a system
binary, and link that into librosie.a
- Generate the (Rosie vm) bytecode for rpl/rosie/rpl_1_3.rpl at build time
- Incorporate that bytecode into librosie.a also, so it's not a separate file
- The result should be a single librosie.a that can be used by itself, or
optionally with any library of useful patterns
Luke said:
We'll revisit this when the rosie build produces a self-contained librosie.a that is appropriate for static linking.
Luke said (2021-10-21):
rosie-sys crate builds Rosie from source. Issue Resolved.
Q-01.02: <<<<----CLOSED, Fixed---->>>> What is the best way to build librosie from within Cargo (the rust code
package manager)? My intention here is to provide a more streamlined experience for Rust developers using the Rosie
crate. It appears there are several options:
Option 1: Include the rosie source inside the cargo crate. Zipped up, rosie-v1.2.2 source is ~8MB, which isn't too bad.
but then the build process would need to unpack it, run `make fetch` to pull in the additional dependencies, and
finally build it. This means the build process relies on the (gitlab?) server being up, as well as bloating the
crate itself. A compromise featuring the bad parts from both options, but perhaps the simplest
Option 2: Pull the rosie source from a server. This saves the crate bloat, but still relies on the rosie source server
being awake. If this option is preferred, what is the best server & protocol to use to fetch the initial Rosie
source? `git clone` ends up downloading a bunch of junk that's not needed for a minimal build-from-source.
Option 3: Is there an option 3?
Luke said: (paraphrasing Kornel)
Kornel believes that a Rust Crate should contain its own source, and that means a sys crate should contain the
source for the library it links. The cargo crate should not install in a shared location, and keep all of its
build products in the temporary cargo directory. In addition, the cargo build process should not require a
network connection.
Kornel's reference guide to creating sys crates: https://kornel.ski/rust-sys-crate
Luke said:
Unfortunately, rosie's Makefile requires the following packages on top of vanilla ubuntu: libreadline-dev,
libbsd-dev, meaning it's not straightforward to build from source in cargo.
I haven't investigated what these libraries do inside of librosie, but I have a suspicion libreadline is for the
repl shell, so perhaps the best course of action is to refactor the rosie build itself to remove that dependency
from a "library-only" build. Perhaps the stdio functionality from libbsd could be wrapped at compile-time with
a shim to make it use the platform-native io functions, rather than requiring the libbsd library to be installed.
Kornel's advice in such a situation is to package each C library dependency in its own -sys crate so Cargo
can build the whole thing. That sounds like more side-work than I want to take on right now. Especially if the
librosie build might change to eliminate those dependencies.
Punting the build-librosie-from-source-within-cargo feature to the future.
Jamie said:
If the requirement is to build from source, then include a full zip of the
Rosie source along with its submodules (lua, cjson, ...) is probably the
simplest approach.
An odd idea, perhaps worth considering, is to package Rosie in a crate
WITHOUT the Rust interface -- which I know sounds bizarre. But if there
were two crates, then you could express version dependencies between them,
right? So a person could use one crate merely as a convenient way to
install Rosie, and to upgrade it independently of the Rust interface which
can then evolve at its own pace.
Luke said:
Jamie & Kornel appear to agree on most points. Jamie independently has intuited the
correct function of -sys crates, unlike Luke's abominable usage of the term.
The plan of action will be:
-rename the current crate from "rosie-sys" to "rosie-rs".
-create a *REAL* -sys crate to build librosie from source, at a future time, perhaps
when librosie can eliminate the dependencies on libreadline and libbsd.
Luke said (2021-10-21):
rosie-sys crate builds Rosie from source or links existing librosie. Building from source excludes
CLI, and thus has no external dependency on libreadline. (Forgot to verify libbsd)
Q-01.03: <<<<----CLOSED, Nothing to do---->>>> How does the pip installer handle the above situation? I noticed
`pip install rosie` doesn't result in a shared library ending up anywhere in the link path. Perhaps I should copy
the pip approach, although Python is less suited to building executables designed to be deployed in binary form
(vs. Rust), so perhaps the Python approach won't suit us here.
Jamie said:
The Python module for using Rosie simply requires the user to first install
Rosie. I think this is probably the right way for all the language modules
to work, but it does mean that we have to do better at packaging.
... Pip only installs the Python part; the user must first install
Rosie, which provides librosie as well as the rosie (CLI) binary.
Rust devs may have different requirements from what is provided by the
"usual" download and install process. Perhaps some of those can be
satisfied by the Rosie project itself? E.g. we could build a static
librosie.a; we could make it possible to `apt install rosie` or `yum install
rosie`.
Luke said:
I missed that pip wasn't installing librosie. I just assumed it was installing it somewhere hidden. I
tested `pip install rosie` on a system with no librosie present and the install suceeded. I didn't attemp
to use rosie within python. <--Forehead is sore now-->
Q-01.04: <<<<----CLOSED, Nothing to do---->>>> I would like to have a crate version that references the
librosie version, but I want a minor-digit to allow for a revision to the rust crate without a revision
to librosie, and unfortunately Cargo only supports 3-tupple versions. What would be the least-bad option
for versioning the rust crates?
Jamie said:
This is a tough one. While we've been trying to maintain a "Rosie version"
as a major/minor/patch tuple, it's been challenging because sometimes a
change is only to the CLI. Users of librosie may see literally no change at
all in some cases. (And internally we maintain an "RPL version", currently
at 1.3, which tracks language changes. There have been only additions, no
feature changes/removal, so the major is still at 1 and should stay there
for a long time.)
Suppose Rosie were packaged separately from the Rust interface. What would
you need then regarding versioning? When loading librosie.so, you'd need to
check to make sure it was within some acceptable version range. So we need
a version API. When statically linking with librosie.a, you'd want a
compile-time check for the same, right? In C, you'd include librosie.h,
which should provide a version number (but probably does not, currently).
I'd like to explore this further.
Luke said:
A real -sys crate could track the librosie version in lockstep. Then the -rs "rust interface" crate
would get its own independent version and could use the cargo dependency mechanism appropriately.
Q-01.05: <<<<----PUNTED to future---->>>> What is the name / server for the rosie ubuntu package?
Jamie said:
If it's acceptable to impose "first install Rosie" on the Rust programmer,
then we could look at platform-based package managers (as opposed to
language-based). `brew install rosie` works if (1) you use brew, and (2)
you add an additional "tap"
(https://gitlab.com/rosie-community/packages/homebrew-rosie).
Today, brew builds from source, but I've experimented with their "bottle"
feature which installs pre-built binaries, and it looks straightforward.
A volunteer who could construct a package for apt and for the thing that
replaced yum would be very much appreciated in this regard. I would buy
them a beer or other sustenance.
Luke said:
Facepalm. I guess in all the installing and uninstalling of rosie I was doing, I forgot I never
did use `apt-get install` to install rosie.
Unfortunately I don't think I'm the right person to create the apt package. At least not right now.
Q-01.06: <<<<----OPEN, follow-up questions---->>>> The Deployment of rosie leaves a lot to be desired, in that
it requires the rosie lua files and standard pattern library to be placed somewhere on the install disk and
then it requires the library to be initialized with the appropriate path. A better solution would be to
embed a default set of lua scripts and patterns into the compiled binary somehow.
One option is to use zlib to create a blob from the filesystem directory, and read from that blob. Perhaps
a better option is to systematically address every place the lua files are absolutely required and replace
that with C functionality, so the Lua scripts are optional. I don't know if that is realistic.
Section 2, Pedantic Memory Safety Concerns
Disclaimer: many questions in this section have obvious answers, but Rust teaches a person to be a stickler for memory
correctness, so in many cases, I'm not necessarily asking about what LibRosie currently does, but instead I'm asking
whether a contract exists such that no future change to librosie will change that behavior.
(But also, in other cases, I didn't fully trace the code all the way through, so please pardon the ignorance on my part.)
Q-02.01: <<<<----CLOSED, Nothing to do---->>>> I assume that calling `rosie_libpath` with a non-null string pointer
will always result in the engine fully-ingesting the path data, such that it will always be safe to free the string
upon the return from `rosie_libpath`. Correct?
Jamie said:
Yes. The lua_pushlstring() copies the string for us. I should document this.
Q-02.02: <<<<----CLOSED, Nothing to do---->>>> Same question as above, but for `rosie_compile` and the `expression`
argument. I assume the client can free the buffer containing the expression upon returning from `rosie_compile`.
Correct?
Jamie said:
Yes. The lua_pushlstring() copies the string for us. I should document this.
Q-02.03: <<<<----CLOSED, Nothing to do---->>>> Same question as above, but for `rosie_match` and the `input` argument.
Can I safely assume that I can free the `input` buffer upon this function returning, and neither the engine nor the
match_result has taken any pointers into the data from the `input` buffer?
Jamie said:
Yes. The librosie functions do not keep any references to the `input`
buffer. And currently, we do not modify the input buffer at all. There's a
an optimization we're considering for the future in which we would modify
the `input`, btw, but I what we should so is create an alternative
rosie_match API that you can call if you're ok with us mangling the input
buffer. (We can, in that case, restore it on return to the caller, if the
caller needs that, but meanwhile thread safety has left the building.)
Luke said:
As long as the alternate API has a different entry point, Rust can hapily work with that 'modifying' API.
In fact, it will even enforce that the no other threads are relying on the memory not changing while the
call is in-flight. Rust is careful like that.
Q-02.04: <<<<----CLOSED, Nothing to do---->>>> Same question, but for `rosie_trace` and the `input` argument. I want
to be sure the returned `trace` buffer will never be pointing back into the `input` buffer, nor will the engine
retain any pointers into memory owned by `input`.
Jamie said:
Right. The `rosie_trace` code makes little attempt to be efficient. It
copies the input buffer (using lua_pushlstring()) before it gets started,
and it returns a newly allocated copy of `trace` when it's done.
Details about `trace` buffer: When tracing succeeds, it returns a string
containing a representation of what happened during matching. At
librosie.c:928 we obtain a pointer to that string using lua_tolstring().
The string itself is managed by Lua (which will gc it eventually), so we use
rosie_new_string() to make a copy of it to return to the caller. The caller
is then responsible for freeing the `trace` return value if the call to
rosie_trace succeeded.
On reflection, perhaps a better approach here would be to ask the caller to
free `trace` if it's non-null? This situation could be problematic: Caller
to `rosie_trace` provides non-null `trace` pointer (could be uninitialized
or worse, the only pointer to an allocation); the call to rosie_trace fails,
in which case the `trace` arg is has the same value upon return. The caller
should NOT free `trace` in this case.
Luke said:
For the Rust crate, it's a non-issue because we always pass a NULL ptr into rosie_trace(),
and the cleanup code is smart about (not) deallocating null pointers when finished. All of
that stuff is abstracted away from the user of the Rust crate anyway, and it's isolated as
part of the interface where Rust meets C.
In general, however, asking the user to free the pointer conditionally, depending on whether
the trace was sucessful might be called a "foot-gun" on the Rust message board that I
frequent. Since the trace pointer isn't optional, I think you'd be within your rights to
NULL it out as well, to cover the case where the call fails.
Q-02.05: <<<<----CLOSED, Nothing to do---->>>> Not really a question, but there appears to a bug in the
implementation of rosie_compile. *pat is set to 0 (line 590) before pat is checked against NULL (line 598),
meaning that if the arg were null then the code already would have crashed before the check. So the check
appears to be pointless unless I've misread something.
Jamie said:
Good catch, thanks! Fix staged for next patch release. I see that I could
be more defensive here about the engine parameter as well, which if null
will cause a crash.
Q-02.06: <<<<----CLOSED, Fixed---->>>> How does the life-cycle of the `rosie_match` argument's `match.data`
member work? I see that the engine owns the memory so the client shouldn't free it. But how long can the
client depend on the pointer being valid? Does anything inside the engine cause these pointers to be freed
or re-used? Or does the engine just keep accreting match result buffers until the engine itself is freed with
`rosie_finalize`?
Luke said: (paraphrasing Jamie)
Answer: The buffer is reused with each call to rosie_match. This design choice was made to reduce malloc / free
overhead in situations where rosie_match is called repeatedly in a loop.
In the future, an API that allows the buffer to be retained by the client may be advantageous for performance
in situations where the client wants to keep multiple match result buffers around without copying them. This
is not the common use case however.
Luke said:
For the Rust interface, this means the RawMatchResult structure will take a mutable borrow of the engine,
preventing any engine access while the RawMatchResult is alive.
Updated RosieEngine::match_pattern_raw() so it takes a mutable borrow of the engine, and then creates the
RawMatchResult with the lifetime of that borrow.
Luke said (2021-10-21):
Update: With the addition of the rosie_match_2 api, the engines now maintain a separate buffer for each pattern.
This means I can implement (have implemented) the "singleton engines" where I allow the engine to be hidden
from the user. Now the RawMatchResult struct takes a mutable borrow of the Pattern object to make sure that
pattern can't be used for future matching while the RawMatchResults are alive.
Q-02.07: <<<<----CLOSED, Nothing to do---->>>> the code for `rosie_load` says "N.B. Client must free 'messages' ",
but I spotted a few places where messages was set using `rosie_new_string_from_const`, which means the pointer
points to a static, and shouldn't be freed. However, in the common case, the ptr gets its value from
`rosie_new_string`, which does perform a malloc(). This issue exists in several places outside of `rosie_load`
as well.
Jamie said:
This is a situation that either needs clarification in the Rosie API doc
(which barely exists:
https://gitlab.com/rosie-pattern-language/rosie/-/blob/master/doc/librosie.md)
or I need to change the "protocol".
Currently, the functions that accept *messages will stomp over whatever
value this pointer had on entry. A new allocation will be made if the
function generates any messages. The protocol is (should be) that the API
ignores the value of *messages on entry, and will either set *messages to
NULL on exit or set *messages to a new allocation containing messages. If
the caller sees non-null *messages returned, they now own that memory.
This is a common convention in some C code, but may not mesh well with
Rust. Of particular concern, I think, is that code calling rosie_load and
other APIs may expect rosie_load and others to reuse *messages. As you
note, it will not. The value for *messages supplied by the caller on entry
to load/loadfile will be ignored. A new value for *messages is (should be)
always set by load/loadfile before returning. That new value may be NULL
(no messages).
I'm interested to know what would work better wrt Rust?
Luke said:
Ha! I didn't even notice the librosie docs file!
https://gitlab.com/rosie-pattern-language/rosie/-/blob/master/doc/librosie.md
I would have asked fewer dumb questions if I had read that. :-/
Your explanation of messages makes perfect sense, and that's essentially what I understood
already from the docs and code that I did read. I must have been really tired when I wrote the
question because I now see that `rosie_new_string_from_const()` calls into `rosie_new_string()`,
so the original issue is moot.
For some reason, I thought it called into `rosie_string_from()`, perhaps because
`rosie_string_from` is implemented right below.
Generally, WRT Rust, Rust is capable of being a low-level language so it's possible to get it to do
whatever C / C++ can do with very few (if any) exceptions. But, philosophically, I think of Rust as
a way to validate the "shape" of an api. I find that the things that are friendy in Rust generally
map to any language, while the things that Rust fights you about might be a bad idea in general.
The one exception is that rust lets the compiler decide whether some objects are heap-allocated or
stack-allocated, while C & C++ force that decision into the code.
Q-02.08: <<<<----CLOSED, Nothing to do---->>>> the comments above `rosie_load` & `rosie_loadfile` makes no
mention of the client needing to free pkgname. However, looking inside the function implementations, it
appears that pkgname is allocated with rosie_new_string, and not retained inside the engine, therefore,
it appears that the caller should also be responsible for deallocating 'pkgname'. Did I miss something?
Jamie said:
This is bug in the documentation. The caller must free `pkgname` if it
comes back non-null but ONLY if the call to load/loadfile succeeded. By
contrast, the `messages` pointer is always set by load/loadfile and if it's
non-null on return to the caller, the caller must free it. This is an
inconsistency in the way the two args are handled.
Luke said:
That's already what the Rust code does. :-)
Section 3, Behavior Clarifications
Q-03.01: <<<<----CLOSED, Fixed---->>>> Why does librosie return 0 (SUCCESS) in certain failure situations?
Am I misunderstanding the purpose of the error result code? For example, I get success for:
* An invalid pattern syntax sent to `rosie_compile`
* text that fails to parse, sent to `rosie_load`
* The package specified to `rosie_import` doesn't exist
* The package.rpl file specified to `rosie_import` has a syntax error
* An invalid file path or other file system err (e.g. no access), in `rosie_loadfile`
* Syntax errors in the rpl file, opened with `rosie_loadfile`
Jamie said:
The "first level" of return value indicates whether the API call succeeded
at a very basic level, which should correspond to "no internal errors were
encountered".
Our goal is: If you get the API to return a failure code, then either you
violated the protocol for using that function (e.g. by supplying NULL where
it's not allowed) or there's a bug in Rosie. The first should be easy to
rule out, though it may require logging to be enabled (which is currently a
compile time flag, alas). Once any usage issues are ruled out, the bug is
ours and should be reported.
Luke said:
Ok. I've taken the liberty of adding a few additional errors to the Rust `RosieError` enum:
RosieError::PatternError is any success code from rosie_compile, that still results in an
invalid pattern.
RosieError::PackageError is any success code from rosie_load, rosie_loadfile or rosie_import
that results in a NULL package name or an error status in the "ok" parameter.
In a perfect world there would be a way to differentiate an rpl syntax error from some of the
other error conditions (such as missing file) without parsing the JSON messages (I'm trying to
keep the JSON parser dependency out of the rust crate) but that's a minor knit.
Regarding the "ok" parameter, I think there is a documentation bug. The librosie docs say:
"If ok is non-zero, an error occurred, and messages will contain a JSON-encoded error structure."
Empirically, however, the value appears to be a bool represented as an int, so therefore, non-zero
is success.
Q-03.02: <<<<----CLOSED, Fixed---->>>> What is the nicest way, in your opinion, to communicate a "no match" from the
Rust equivalent of `rosie_match`? As you know, `rosie_match` returns SUCCESS, but a NULL pointer in the match
result data. In Rust, NULL pointers are not a thing, so I thought I'd create a "NoMatch" error code. But
"NoMatch" isn't really an error in the same way that other errors are errors. On the other hand, I don't want
to bloat the function with another argument. So, a "NoMatch" error is the cleanest interface, as long as it's
conceptually ok.
Jamie said:
Ah, this is an interesting design question. In Python, we throw an
exception if the API call fails. If the API call succeeds, then Python can
return NULL (for no match) or a match data structure. In Go, we return an
error status and a value, so the error status takes the place of the
exception, and it can indicate "no error" while the value is NULL to mean
"no match". I don't know enough about Rust to make a recommendation here.
Luke said:
I think I've decided that "no match" is actually something the MatchResult object should be able to represent.
Since the object is a black box, this doesn't complicate the interface at all. Now, both MatchResult and
RawMatchResult have a "did_match()" method that returns a bool.
Luke said (2021-10-21):
Update. the match call, (now called match_str) is capable of returning a number of types, one of which is a
bool. In the context of returning a bool, the funciton gets to skip the work of encoding the MatchResults.
Q-03.03: <<<<----CLOSED, Nothing to do---->>>> What are the situations where a valid "messages" string is returned
along-side a successful result? I noticed a comment saying this could happen, but I have never seen it. The
reason I ask is that I can roll the messages argument inside the return error code, and simplify the API. But
it will mean there will be no way for the caller to get the "messages" if the function sucessfully provided what
it was invoked to provide.
Jamie said:
I don't think this can happen with the current code, because we don't issue
any compiler warnings today, only errors. But we'd like to add warnings and
also the occasional informational message, so we planned for this (perhaps
prematurely).
You have a choice as an interface designer to make this simplification
(combining the messages and error code). It will not break anything. A
future version of librosie may cause you to rework your interface so that
the Rosie user can get compilation warnings or info, but that's not an issue
today.
Luke said:
I see that warnings are a worthwhile reason to keep this messages channel around in a success case.
I think the solution for simplifying the interface is to provide higher-level calls, not to remove
functionality (even theoretical future functionality) from the low-level calls. So I'll leave the
interface as it is.
Q-03.04: <<<<----CLOSED, Nothing to do---->>>> How important is the `rosie_matchfile` entry point for a Rust-based API?
On the "pro" side, by handing the file IO operations to librosie, librosie can presumably do a better job streaming
the accesses than a naieve implementation that read the whole input file into a buffer and called `rosie_match`.
On the "con" side, it seems that there is little to be gained by calling this directly from Rust through a native
interface, versus just invoking the `rosie` cmd-line tool. Am I misunderstanding the purpose of this entry point?
Jamie said:
This entry point is purely for programmer convenience, and it does seem
reasonable to not support it in some language interfaces to librosie. The
`matchfile` API makes it easy to write a new CLI, which is a special use
case. And it can also speed things up for some languages when a programmer
happens to want exactly the functionality it provides and no more -- with
the speed up coming from not having to marshal strings over to C and back
for every line of a file. If Rust is able to pass librosie a
pointer+length, i.e. without having to copy the Rust string just for
librosie, then the Rust developer has no need for `matchfile`.
Luke said:
All Rosie Rust interfaces use zero-copy when possible (I know that's a tautology). But it's not
a tautology to say All the Rosie Rust interfaces use zero-copy everywhere Rosie permits it.
It's decided then. The Rust interface will not call `rosie_matchfile`.
Q-03.05: <<<<----CLOSED, Fixed---->>>> What is the intended use-case for the "as" argument to `rosie_import`?
Is there a situation where a user may want to load a package under multiple names? That would make sense if it
were possible to extend packages and then you might wind up with the original package for compatibility and an
extended version that is modified for a specific purpose. But I'm unclear on how the "package extending"
functionality would work. Bascially, I'm asking why the `pkgname` argument isn't always enough.
Jamie said:
When you import an RPL pattern library, you have the option of using it
under a different, custom name. You might do this if the package name is
long, to save typing a long name when you refer to an imported pattern, but
perhaps more importantly, in RPL the pattern name appears in the output.
(The pattern name ends up in the "type" field of a match.) The ability to
"import X as Y" lets the pattern writer ensure that "Y.foo" is a pattern
type in the output, not "X.foo", if that's what they want.
By the same reasoning (having control over the output), maybe the programmer
wants to have X.foo appear in some places and Y.foo in others, but we don't
support importing the same package multiple times under different names.
$ rosie --rpl 'import net as FOO' match -o jsonpp FOO.ip <<< "127.0.0.1"
{"type": "FOO.ip",
"data": "127.0.0.1",
"e": 10,
"subs":
[{"type": "FOO.ipv4",
"e": 10,
"data": "127.0.0.1",
"s": 1}],
"s": 1}
$
Luke said:
Oh, Ok. I misunderstood. I've fixed the documentation in the Rust crate.
Q-03.06: <<<<----CLOSED, Docs Changed---->>>> Along the same lines, why does `rosie_import` set `actual_pkgname` to
`pkg_name` in the success case, instead of to `as`? Or why is this arg is even needed? I.e. is there a case
where librosie might create a brand new name, or might sometimes return the `pkgname` and other times return
`as` depending on some internal logic?
Jamie said:
I'm glad you asked this, because it involves a subtle issue that needs to be
documented. The `import X` declaration causes a search through the
configured list of directories (libpath) for a file X.rpl. That is, the
argument to `import` specifies the base part of a file name, which resolves
to a "location" in the file system.
We don't know what we'll find in the file X.rpl. For X.rpl to import
successfully, it must be valid RPL and have a package declaration. But what
if the package name inside the file does not match the file name? E.g.
-- file X.rpl:
package Y
test = "hi"
We allow this, mostly because file systems are weird, especially around
Unicode but also due to symbolic links, hard links, and other redirections.
So we didn't want to enforce some flavor of string equality between the file
name and the package declaration inside of it.
And this is why `rosie_import` returns an "actual package name", which is
the name that it found in the package declaration inside the file.
Importantly, the declared package name is the name used in RPL patterns.
For example, this should work (provided the directory containing X.rpl is in
/tmp):
$ rosie --libpath /tmp --rpl 'import X; pat = Y.test' match -o jsonpp pat <<< "hi"
{"data": "hi",
"s": 1,
"type": "pat",
"e": 3}
$
In the example above, the pattern being matched is Y.test, not X.test.
It is certainly not a best practice to put an RPL package in a file that has
a different name. The primary use case for supporting it is to cope with
file system limitations (Unicode) and fancy features (links).
There's an additional use case, though it remains hypothetical in the sense
that I don't know if anyone is doing this. The `import` statement (and API)
can have as its argument a path (interpreted relative to a libpath
directory) and not just a base file name.
This feature allows us to organize pattern packages, perhaps by topic, but
also by language (human language, like English). Suppose instead of a
single date.rpl file, we created a `date` directory in one of our libpath
directories. Within the `date` directory, we could have several files,
e.g.
date/es.rpl // Names of days, months en Español
date/fr.rpl // Names of days, months en Français
date/en.rpl // Names of days, months in English
If all of these files contained the declaration `package date`, then we can
write a bunch of RPL patterns using date.xyz, where xyz is defined in all of
those files. We can `import date/es` to get the Spanish date patterns,
knowing that because es.rpl contains `package date`, we can use date.xyz in
our own patterns.
Of course, we could organize our files another way, such as by language
first, and then topic:
es/date.rpl // enero, febrero, lunes, martes, ...
fr/date.rpl // janvier, février, ...
en/date.rpl // January, February, ...
In this case, we don't have any need for the file name to be different from
the package name. (Unless we allow the Spanish team to call their file
es/fecha.rpl instead -- which is fine, as long as it has `package date`
inside.)
Luke said:
I've updated the documentation in the Rust crate, but did not include the complete information
from your response. If you incorporate the above into a Rosie document, I'll link to it from the
Rust crate docs.
Q-03.07: <<<<----CLOSED, Fixed---->>>> How should I think about the `start` index for `rosie_match` & `rosie_trace`?
It seems to be 1-based. But what does passing 0 signify conceptually? Empirically, passing 0 just seems to mess
everything up. For example, it causes "rosie_match" not to match, while "rosie_trace" does match, but claims to
match one character more than the pattern really matched. If 0 has a conceptual meaning, I'd like to make sure
it's documented and tested. And if 0 is never valid, I will check for it as an invalid argument.
Jamie said:
Indeed, the influence of data science (and Lua) is apparent with the 1-based
indexing in Rosie. Rosie should check for 0 being passed in, and return an
error. Until we get that patched, I would follow your suggestion of
catching this in the Rust interface.
It's not clear that the choice of 1-based indexing and inclusive ranges
(where 1..3 includes characters 1, 2, and 3) is the best choice. Data
scientists seem fine with it, unless they program a lot. :-/
Dijkstra seems to have won this war, in the sense that almost every
programming language uses 0-based indexing and inclusive/exclusive ranges
(where 1..3 includes only the second and third characters, because char 1 is
the second char, and char 3 is the fourth char and not included in the
range). Rosie v2 (some day) may revisit this.
Luke said:
Ok. Passing 0 for start now returns RosieError::ArgError in Rust. For good measure, I also check the
upper-bound as well (start <= input.len), because string length is always stored for Rust strings (unlike C
strings that require expensive scanning for the NULL terminator)
Q-03.08: <<<<----CLOSED, Nothing to do---->>>> How should the `abend` field of the match result data be exposed to
the client? If its meaning is encoder-specific, is there any documentation I can reference?
Jamie said:
It's not encoder-specific, but it is a conundrum. There's a Halt
instruction in the Rosie bytecode, and it is used by the RPL `error`
function to halt the matching while preserving everything that has been
matched thus far. I have found uses for this when writing parsers in RPL,
because sometimes I want to signal a "syntax error" in the input and stop
the matching. If matching stops via Halt (`error`), the abend flag is set,
but the match data looks completely normal.
We could argue that the abend return value is not needed, because using
`error` in an RPL pattern causes a node to be added to the parse tree, and
it has the type `error`. So the program that consumes the Rosie output will
know that the match was abnormally ended. The abend return value is, then,
just a convenience.
FYI: The RPL `message` function also inserts a node into the parse tree, but
unlike `error` it does not halt the matching. And, I just wrote some
examples using Rosie 1.2.2 that show some brokenness with both `message` and
`error`. They take a string argument which should appear in the data field
of the output, and that string is not appearing. I'll patch and add tests
for this.
Luke said:
Ok. For now, the field will remain inaccessible to the API users.
Q-03.09: <<<<----OPEN, follow-up questions---->>>> What kind of things are rc files used for? Is there an example or
documentation? I'm working on the assumption that I can skip this functionality for the Rust crate because we
probably don't want user-specified configuration overriding the behavior the app developer intended when they
incorporated Rosie as a component inside their program.
Jamie said:
Agree that you can skip this. The arcanely named "run control" file
predates Unix, I think. The Rosie CLI (by default) reads ~/.rosierc if it
exists, and will configure some settings based on what it finds. The file
format is defined in rpl/rosie/rcfile.rpl, and it's a subset of the RPL
syntax. My usual .rosierc file looks like this:
libpath = "/usr/local/lib/rosie/rpl"
libpath = "/Users/jennings/Projects/community/lang"
libpath = "/Users/jennings/Projects/community/rawdata"
-- Changed net.path to green for demos:
colors = "*=default;bold:net.*=red:net.ipv6=red;underline:net.url_common=red;bold:net.path=green:net.MAC=underline;green:num.*=underline:word.*=yellow:all.identifier=cyan:id.*=bold;cyan:os.path=green:date.*=blue:time.*=1;34:ts.*=underline;blue:num.*=red;underline"
colors="destructure.find.<search>=red:destructure.alpha=blue:destructure.num=cyan"
There are two uncommon design decisions in evidence here, and even now after
5 years I wonder what is the best approach.
(1) You can add more components to a list-based configuration item like
libpath or colors by adding another "assignment" statement. Probably the
syntax should have used "+=" and not "=" because that's what they do. The
benefit is that it's easy to add something new and then take it out --
because it's on a line by itself, you don't have to edit a long list.
(2) If you configure a setting in the rcfile (or on the command line, or
through the API), we throw away any default value for that setting. The
rationale can be seen using libpath: If you set libpath, it is your choice
as to whether or not to include the path to the standard library, and if you
include it, where in the libpath it should go. The downside to this is that
you have to list the standard library as soon as you customize libpath --
which means you have to know where it is. (The `rosie config` command and
API can tell you this information and more, which helps in this regard.)
Luke said:
Your answers piqued a few more questions:
A: Does the `rosie_libpath` function append additional paths, in a similar way to rcfile assignment statements?
If so, how can I clear out old paths? If not, I assume I can set multiple paths using one call to
rosie_libpath, so what delimiter / escape sequence should I use between filesystem paths?
B: Later on, (Q-04.02), I ask about how to configure colors. So now my question is: can this be done
through the api without an rc file? Unlike rosie_libpath which is two-way, it looked to me like
rosie_config was only able to get config values but not set them. Did I miss something?
Q-03.10: <<<<----OPEN---->>>> The `trace` output appears to be substantially less useful when the `find:` and `findall:`
pattern prefixes are used. Is this by design or a bug?
Consider the output from this:
let mut trace = RosieMessage::empty();
let pat = Rosie::compile("find:date.any").unwrap();
pat.trace(1, "Of course! Nov 5, 1955! That was the day", TraceFormat::Full, &mut trace).unwrap();
println!("Trace = {}", trace.as_str());
vs. this:
let mut trace = RosieMessage::empty();
let pat = Rosie::compile("date.any").unwrap();
pat.trace(1, "Nov 5, 1955! That was the day", TraceFormat::Full, &mut trace).unwrap();
println!("Trace = {}", trace.as_str());
Q-03.11: <<<<----CLOSED, Fixed---->>>> The Rosie CLI loads the dependencies of an expression prior to compiling it, and I wanted to
offer this convenience also, so I implemented `RosieEngine::import_expression_deps` to be used by the higher-level calls.
In the CLI, it appears that the code to do this is driven primarily from Lua. Currently, my Rust code calls
`rosie_expression_deps` which then jumps into Lua, ends up parsing the expression and evaluating the dependencies,
putting that info into a Lua table, encoding that table as JSON, passing it back to Rust, and then I parse the JSON,
and finally call `import_pkg` on each result. What I'm getting at is: it seems like it would be better if librosie
exposed a `rosie_syntax_op` to just call the same Lua routine as the CLI.
Luke said (2021-10-27):
I added `rosie_import_expression_deps()` to librosie, which calls into the Lua function, `import_expression_deps` in
engine_module.lua. That function is mostly code cribbed straight out of the local function `import_dependencies` in
cli-common.lua.
Q-03.12: <<<<----OPEN---->>>> Package Namespace Path inconsistencies. Rosie seems to assign a different namespace path to
packages depending on how they are loaded. Is the following descrepency expected?
```
let mut engine1 = engine::RosieEngine::new(None).unwrap();
let mut engine2 = engine::RosieEngine::new(None).unwrap();
engine1.import_pkg("date", None, None).unwrap();
engine2.load_pkg_from_file(engine2.lib_paths().unwrap()[0].join("date.rpl"), None).unwrap();
let date_pat1 = engine1.compile("date.us_long", None).unwrap();
let date_pat2 = engine2.compile("date.us_long", None).unwrap();
println!("Imported = {}", date_pat1.match_str::<MatchResult>("Saturday, Nov 5, 1955").unwrap().pat_name_str());
println!("Loaded = {}", date_pat2.match_str::<MatchResult>("Saturday, Nov 5, 1955").unwrap().pat_name_str());
```
This all doesn't quite make sense to me in light of the explanation in Q-03.06. So it's either a bug or more
documentation is needed.
Section 4, Rust-level API Aesthetics & Documentation Questions
Q-04.01: <<<<----CLOSED, Fixed---->>>> How should I describe the match_result.ttotal and match_result.tmatch in the documentation? I see that they are
timing counters, but what operations, precisely, do they measure?
Luke said:
Docs Jamie referenced had the answer to this question.
Added accessors: RawMatchResult::time_elapsed_matching() and RawMatchResult::time_elapsed_total().
Q-04.02: <<<<----OPEN---->>>> Where is the documentation for the `color` encoder, and specifically how to customize the colors associated with
each sub-expression? I'd like to link to it from the Rust documentation.
Q-04.03: <<<<----OPEN---->>>> Where is the documentation for implementing a custom encoder in Lua? I'd like to link to it. But I'd also like
to read it myself.
Q-04.04: <<<<----CLOSED, Fixed---->>>> I'm starting to feel that I should rethink the lifecycle management of PatternID objects in the Rust interface.
In particular, would it be better to automatically free them when they go out of scope rather than giving the
user the API call to do it manually?
LP IMPLEMENTATION NOTE: Implementing the `Drop` trait on a PatternID means the PatternID needs to have a reference
to its engine, which isn't possible to do directly because we still need calls that have mutable (and therefore
exclusive) access to the engine. We could implement a back-door to keep this access, but it would come
with an additional runtime validity check each time the pattern is accessed, to make sure the engine is still
valid.
Also, I still want the patternIDs to be clonable, so I'd also have to make them capable of ref-counting.
Possibly a small can-of-worms, but perhaps worth it because it simplifies the UI quite a lot by not requiring the
client to worry about freeing compiled patterns they are no longer using.
Luke said (2021-10-21):
Jamie added pattern-specific output buffers, accessible through the rosie_match2 call.
I have created a Pattern Rust struct, which subsumes the former PatternID (which was removed). The Pattern
implements the Drop trait, and therefore frees the patterns.
The Pattern struct also hosts the match calls, and therefore ensures the buffers aren't improperly referenced
by multiple RawMatchResult structs.
Q-04.05: <<<<----CLOSED, Fixed---->>>> If we go in the direction above, I'd also consider changing match and trace to
be methods of the Pattern, rather than methods of the Engine. So basically, from the client's perspective, the engine
creates patterns, and the patterns are what are used to match and trace.
Unfortunately, the fact that the match buffer is owned by the engine might complicate things from the user's
perspective. If librosie could give us a separate buffer per pattern would make this cleaner. Otherwise, I'd
say this change would make the API worse, not better. Thoughts?
Luke said (2021-10-21):
Done, exactly as described. See comment on Q-04.04.
Q-04.06: <<<<----CLOSED, Fixed---->>>> Does it make any sense to put a pattern-cache in front of "compile", so the same
pattern isn't compiled multiple times? Basically checking the string against strings that have already been compiled.
This might pave the way towards a high-level compile + match call that could be called in a loop without horrible
performance.
Luke said (2021-10-21):
The singleton engine supports the Rosie::match_str() method, which is a one-line compile + match call which caches
compiled patterns. If the user explicitly calls `compile` themselves, they probably have a reason for it (Like
wanting a separate pattern with its own results buffer) and therefore they should get a fresh compile.
Section 5, High-Level Interface Discussion
This section outlines some places where, after using Rosie for the past 3 weeks or so, I have felt there are a few places
where I wished I didn't have to type so much. In addition, I've tried to recruit a few friends to use Rosie as well, and
this captures some of their feedback about features they felt were missing or could be streamlined. Of course these are
just opinions, and opinions of people who aren't as knowledgeable about the subject as you are. So please take them for
what they are - possibly misguided ramblings of novices.
That said, some of these ideas don't involve any changes to the librosie core, and can be nicely layered on top of the
API as it already exists. Others might need a small interface added, while some involve pushing features into rpl itself.
Finally, it's entirely possible that the capability to do some of these things already exists, and I just haven't fully
appreciated the flexibility of the interface as it is currently designed. Please point out if this is the case.
Q-05.01: <<<<----OPEN---->>>> Match-Result-paths. Basically, I'm essentially imagining a convenience layer to access sub-matches for a
pattern. The goal would be to provide a one-line call to extract the string matched by any nested sub-expression.
For example, if `date.any` matched some input, I might be able to extract the year using something like:
`let year = match_result.extract_sub("any.slashed.year");`.
This could be implemented easily using an existing standard like JsonPath on top of the existing JSON match results,
but there may be an opportunity to do something cleaner, more powerful, or better-fitted to Rosie.
Luke said (2021-10-21):
I found this to be particularly needed when using the `find:` and `findall:` pattern prefixes. In that case,
retrieving the meaningful part of the matched substring requires descending a MatchResult tree.
Q-05.02: <<<<----OPEN---->>>> Wildcard Result-Paths. You don't have to go very far in the above direction before realizing that naieve
paths are not terribly useful unless you know exactly which sub-expressions are going to match. And if you knew that,
you probably don't need the top-level expression at all. So ideally there would be a way to get the year from a
`date.any` without knowing what format the input string was in. Something along the lines of: "any.*.year"
Unfortunately this introduces ambiguity in the case where the same sub-pattern occurrs in multiple places, as caused
by a '*' in the original pattern. I honestly don't have a good way to reconcile this but I think people will tolerate
some sharp edges if it lets them write one line of code instead of writing what previously took 5 lines.
Q-05.03: <<<<----OPEN---->>>> Recursive Widlcards. In the case of the `date.any` pattern, we know the year is always at the third level.
However, in some deeper patterns, we may not be sure precisely where the sub-expression we want will live. So I'm
imagining a token that can find sub-expressions by name, along the lines of: "any.**.year", where the year
sub-expression would be found regardless of where it is nested.
Q-05.04: <<<<----OPEN---->>>> Choice-Results. Sometimes the year is matched by the `year` sub-expression, but in other formats it is matched
by the `short_long_year`, and both roll up into `date.any`. If we wanted to specify we wanted the "conceptual year",
we would need to say: "any.**.[year | short_long_year]" (BTW, I'm sure my syntax choices are terrible, I'm just
making stuff up to express a concept.) Pretty quickly, it's becoming clear that we might a lot of the power of
Rosie to succinctly extract results from Rosie. I don't know if that's a good thing or a bad thing.
Q-05.05: <<<<----OPEN---->>>> Pattern-Specific Encodings. Consider the `date` package. The conceptual data of `month`
may be represented by any of `month` (which is numeric), but also `month_shortname`, `month_longname`, or `month_name`,
which are all various alpha strings. It would be super-cool to be able to declare some kind of a unifying-expression
that could map "1", "January", and "Jan" back to the number 1. It solves the "Choice" problem above, and allows the
standard pattern library to export a normalized interface for the matched data as well. So, I could extract
"numeric_month" from the match results, and get "1", regardless of whether the input string said "Jan", "January", "1",
or "01".
I know this a conceptual break from the match results as they currently are, however, because now the match results
from these special "encoding patterns" don't exist as subsets of the input string. So I'm not sure what that does
for the rest of the design, if it throws everything into limbo. But it would be a useful feature, and it could be
implemented in a layer on top of the core matching engine, if it's deeply incompatible with the rest of Rosie.
UPDATE: I see from looking in the Python 'byte' decoder code, that "constant capture" patterns are already a thing,
so perhaps this won't be as fundamental as I had feared.
Q-05.06: <<<<----OPEN---->>>> Inline Annonymous Sub-Expressions. I was talking to a friend of mine about Rosie, and he said "I'll try
out Rosie when it can do this in one line: `let [x, digits, word2] = target.match(/^([\.0-9]+)-(\w+)$/);`" (He's
a javascript programmer) Anyway, it would be easy enough to layer together a high-level compile+match call, but
the part about defining what sub-expressions end up in which variable is something I don't know how to do with
Rosie unless the sub-expressions are named. Do you think a syntax for inline-declared sub-expressions within the
same single-line pattern makes any sense for Rosie? Or is it too far from Rosie's intended design philosophy?
Q-05.07: <<<<----PUNTED, Depends on above features---->>>> Search & Replace. The "shape" of a search & replace function might depend on the answers to the above 6 points,
but S&R is super-useful capability, whatever form it takes. This discussion can be postponed until later, as
the discussion has many dependencies on the match-access capabilities, and the rest is essentially just down to
creating an efficient implementation.
Q-05.08: <<<<----PUNTED, Depends on above features---->>>> Meta-Match-State: This idea is way out there, but I figured I'd throw it out. Consider `date.any` again.
date.any is composed of 6 different date patterns in a "Choice List" (Choice List is what I'm calling a list
separated by '/'). I understand that Rosie iterates through choice lists linearly until it finds the first list
element that matches. However, what if we had an alternate form of ChoiceList where each element was given
conceptually equal rank? Basically externalizing the logic to select which choice to match in a choice list
with multiple matches.
Back to date.any as an example. I know this example is flawed because there is no eur_dashed format, that would
be analogous to the us_dashed format, but imagine there were. Now consider that modified date.any matching this
sequence of values:
"26-04-2017", "08-04-2017", etc.
In the example, the first item is unambiguously `eur_dashed`, because 26 is outside the range for month. However,
the second item could be matched by either pattern, `eur_dashed` or `us_dashed`. Because `us_dashed` is first in
the `date.any` choice list, that's the pattern that will match the second element.
But what if we could perform the match as a two-pass operation, where the first pass determines the pattern choice
preferences to find a set of choices that work for all data elements, and the second pass then applies those
choices?
Admittedly, I haven't fully explored the implications of this, and there may be some hairball cases. But the
idea is simply to allow some data elements to be useful in resolving ambiguity in other data elements from the
same data set. As if the match were creating a single mapping for the whole data-set and not an individual mapping
for each data element from the set.
Q-05.09: <<<<----CLOSED, Nothing to do---->>>> Disposable RosieEngines: You (Jamie) made a comment earlier (in Q-01.01)
about the fact that additional engines are "really cheap" to initialize once the rosie core has been bootstrapped.
Are they so cheap that a high-level API could create a brand-new engine for each compiled pattern? Would there be
any other downside to this approach?
It seems like it could simplify the API from the user's perspective.
Luke said (2021-10-21):
In the Slack chat, Jamie pointed out that you could have an engine for each pattern, but it would be inefficient
because the complete set of dependencies would need to be reloaded for each engine. Anyway, it's a moot point
because the per-pattern results buffers allow for singleton engines, and thus the API simplifications have
already been implemented.
Q-05.10: <<<<----OPEN---->>>> Callback to support fuzzy matching. As we discussed over Slack, it would not be practical
to perform "fuzzy matching" (i.e. match strings that deviate from a set of strings by a maximum distance according
to a distance function) using Rosie's current feature set. Doing this with FSAs alone would require a combinatoric
pattern that grows with the number of possible strings in the set it's trying to match.
In addition, the precise functionality each use case requires and the desired computational cost tradeoffs would
make a canonical rosie extension very difficult to design.
Therefore, we concluded at the time that the best path forward may be to allow special patterns that are implemented
in native code. A rough outline of the desiderata for a "callback" or "native pattern" feature would be:
- The ability to register a native pattern, along the lines of rosie_load(), giving the native patterns
symbol name(s) that makes them accessible to other patterns.
- The native pattern implementation would likely be a C (native) function that could receive an input buffer
ptr, a start offset, and possibly other state to facilitate more advanced features. It would return a bool
indicating whether the native code identified a pattern, and an end offset to specify the end of the native
pattern in the input.
- If an enhancement for Q-05.05 is added, the callback should be able to output a "value" string
- The ability to use native patterns as sub-patterns within larger traditional rpl patterns is a requirement, IMO.
- The ability to dispatch sub-pattern matching within a native pattern implementation is very desirable. Perhaps
using rosie_match or something similar, although perhaps we'd need a different call, e.g. rosie_match_sub(),
to maintain internal state continuity within the matching engine.
There are many details yet to be worked out.
Section 6, Misc
Q-06.01: <<<<----OPEN---->>>> Do you know anybody who might be interested testing out / fixing the Rust crate on Windows? I have been
developing on Mac OS & Linux, and can confirm both work as expected, but I don't have access to a Windows development
machine.
Q-06.02: <<<<----OPEN---->>>> This is not related to Rust, but rather a question about the philosophy of the standard pattern library. Does
the standard pattern library exist within a narrow purview to match formats as they are precisely specified,
i.e. defined patterns, for example rfc2822 for date formatting.
Or does the standard pattern library have room for patterns that are "The kind of thing a person might type when
attempting to represent a certain kind of value." i.e. inherrently subjective patterns.
I wrestled with this question when I wrote the currency.rpl package. And it seems like a philosophical judgement call,
balancing convenience against potential ambiguity.
For example, it would be nice if "date.any" could sucessfully match: "Sat., Nov. 5, 1955", or if "time.any" would
match "3:20am GMT" but then where to draw the line?
Jamie said (2021-10-26, Luke paraphrasing verbal conversation recalled from memory):
- There are two separate use cases, one for validating input against rigid standards, and the other for matching "anything
that looks like a X", e.g. looks like a date, or looks like a time.
- The current Standard Pattern Library is targeted at the first, but there is a need for patterns for the second use case.
- Jamie will consider the appropriate pattern naming and rpl file organization.
Q-06.03: <<<<----OPEN---->>>> The https://rosie-lang.org/ website would really benefit from having the RPL reference linked directly from the
sidebar, and having some simple "getting started" examples on the "examples" landing page, rather than links to find
the examples elsewhere. I think this thread summarizes many people's unfortunate first impressions when approaching
Rosie: "https://news.ycombinator.com/item?id=21145755". On the upside, it would hopefully be an easy thing to fix these
minor marketing / communication problems.
Q-06.04: <<<<----OPEN---->>>> The Rosie logo, hosted at https://rosie-lang.org/images/rosie-circle-blog.png sits within a frame of transparent
border pixels. This is apparently a good style choice for the Rosie website where the logo is displayed, however,
the rust documentation anticipates a logo that fills the entire image.
Would it be possible to upload a square image of the Rosie logo that fills the whole frame (maintaining the alpha
mask for the corners), to be hosted on https://rosie-lang.org? I don't think it's picky about resolution, so
200x200px is fine, but so is another resolution.