Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GSoC 2019 Progress Updates #40

Open
sushain97 opened this issue May 6, 2019 · 275 comments
Open

GSoC 2019 Progress Updates #40

sushain97 opened this issue May 6, 2019 · 275 comments

Comments

@sushain97
Copy link
Member

This is an issue so that notifications come through nicely.

@sushain97 sushain97 changed the title GSoC Progress Updates GSoC 2019 Progress Updates May 6, 2019
@singh-lokendra
Copy link
Collaborator

singh-lokendra commented May 9, 2019

Black: The Uncompromising Code Formatter

  • max-line-length is set to 160 in .flake8, whereas default support is for 88. Should I modify it to 160?
  • black uses double quotes, so it'll replace single quotes
  • Regarding its integration with the project, few changes would be required
    • use pre-commit to add hooks, so that black runs automatically every time commits are made. Every new developer would need to run pre-commit install to install the hooks.
    • Add: .pre-commit-config.yaml, Should I use python3.6 for language_version?
    • black --check project_path/ can be used to check the code during CI tests. If some changes are required then exit code would be 1 and builds will fail.
    • Sample output of commits:

$ git commit -m 'test_commit'
[WARNING] Unstaged files detected.
[INFO] Stashing unstaged files to C:\Users\admin/.cache\pre-commit\patch1557367025.
black....................................................................Passed
Flake8...................................................................Passed
[INFO] Restored changes from C:\Users\admin/.cache\pre-commit\patch1557367025.
[master c2f7890] test_commit
1 file changed, 10 insertions(+), 14 deletions(-)

  1. Should I go ahead and integrate Black with apertium-python? A badge is also available for ReadMe file
  2. Currently there are a lot of branches in the project. Should I commit changes to master branch each time, or use some other branch for commiting the code?

@sushain97
Copy link
Member Author

Personally, I think 88 columns is too short. However, I really don't care that much. Also don't care about single/double quotes (despite having my own preference for the former). Don't worry about setting up commit hooks for now.

Should I go ahead and integrate Black with apertium-python? A badge is also available for ReadMe file

As far as checking goes, you can send a PR to switch the flake8 checks to black. This also requires changing the requirements-dev.txt (we should really just switch to pipenv).

Currently there are a lot of branches in the project. Should I commit changes to master branch each time, or use some other branch for commiting the code?

Use your own branches and send PRs to master that I'll review.

We really shouldn't spend any more time on this than the one day that we've already spent. The existing setup is perfectly fine so please don't spend more than an hour or two on changing style checking stuff tomorrow. I'd much rather you get started on the real work and we worry about this later.

@singh-lokendra
Copy link
Collaborator

singh-lokendra commented May 10, 2019

  • I've put the black code formatter on hold for now. I'll use it locally and ensure that linter checks are also passed
  • I have started working on apertium.analyzers Its calling eng-morph.mode which in turn calls lt-proc.exe So I am looking at its source . I am assuming this to be a good starting point
  • The lt-proc is dependent on a lot files. Just the source of fst_processor gave me goosebumps :D
  • I've read the SWIG Python documentation to get a basic idea of handling various cases.
  • Now understanding the flow of control, before starting the work on swig wrapper
  • Searching the logs directed me to swig interface file for hfst. This might be helpful.

Few Queries:

  • I am confused regarding the approach to be taken for making the swig wrapper. Should I start with lt-proc and make wrappers for all the files included or sequentially write wrappers for every file in lttoolbox
  • Regarding the complilation of wrappers which approach should be used:
    • Using disutils Imo this should be used as it can handle various cases
    • Dynamic linking Need to compile each file seperately. Would require some sort of makefile. And specified naming convention needs to be followed, else import error while importing the python library.
    • Static linking Not even SWIG suggests using this one

@sushain97
Copy link
Member Author

  • I've put the black code formatter on hold for now. I'll use it locally and ensure that linter checks are also passed

SGTM.

  • I have started working on apertium.analyzers Its calling eng-morph.mode which in turn calls lt-proc.exe So I am looking at its source . I am assuming this to be a good starting point

Exactly where I would start.

  • The lt-proc is dependent on a lot files. Just the source of fst_processor gave me goosebumps :D

:D I think to start we'll only need to make the edges of the interface usable in Python? All of the FST processor itself doesn't need to be visible from Python.

I am confused regarding the approach to be taken for making the swig wrapper. Should I start with lt-proc and make wrappers for all the files included or sequentially write wrappers for every file in lttoolbox

My perspective is that we should start with wrapping as little as possible in order to get e.g. eng-morph.mode working without subprocesses. That definitely shouldn't require wrapping every single file in lttoolbox. I'm not particularly familiar with SWIG but I think this should be possible? @unhammer / @TinoDidriksen could confirm?

  • Using disutils Imo this should be used as it can handle various cases
  • Dynamic linking Need to compile each file seperately. Would require some sort of makefile. And specified naming convention needs to be followed, else import error while importing the python library.

Since distutils is part of the standard library, I'm perfectly fine with this approach. However, given that there's already a build system in place for lttoolbox (probably nothing too unconventional -- https://github.com/apertium/lttoolbox/blob/master/Makefile.am), you might have to use the dynamic linking approach.

  • Searching the logs directed me to swig interface file for hfst. This might be helpful.

This should serve as a great example. You might even want to copy its approach as much as possible. Take a look at the README: https://github.com/hfst/hfst/tree/master/python.

@sushain97
Copy link
Member Author

Note that if we did end up wrapping a large portion of lttoolbox based on some desire for such functionality, we should release that separately on PyPi and then depend on it here in the apertium module.

@TinoDidriksen
Copy link
Member

Regarding SWIG: The needed functions are the ones that https://github.com/apertium/lttoolbox/blob/master/lttoolbox/lt_proc.cc calls - grep for fstp. and LtLocale::. You shouldn't need to wrap more than that for now.

@sushain97
Copy link
Member Author

sushain97 commented May 10, 2019

Makes sense. The MVP is wrapping every function that lt_proc.cc calls when invoked with -n. A quick look makes that seem like:

https://github.com/apertium/lttoolbox/blob/master/lttoolbox/lt_proc.cc#L111
https://github.com/apertium/lttoolbox/blob/master/lttoolbox/lt_proc.cc#L230
https://github.com/apertium/lttoolbox/blob/master/lttoolbox/lt_proc.cc#L245
https://github.com/apertium/lttoolbox/blob/master/lttoolbox/lt_proc.cc#L270
https://github.com/apertium/lttoolbox/blob/master/lttoolbox/lt_proc.cc#L398-L400

Note that all the file operations can just be done in Python since that shims out to 'native' code anyway.

@singh-lokendra
Copy link
Collaborator

  • Instead of creating swig wrapper for every file, focus only on lt-proc
  • The code is separated in different files, but no need to write wrapper for all of them. Unable to find something relevant to this problem in the docs.
  • Hence implemented a working example of the above problem.
  • Working example: foo() is defined in foo.cpp, which calls bar(), defined in bar.cpp Successfully implemented a wrapper for foo.cpp.
  • ToDo: Check the above implementation for classes, class is defined in a different file, and object is created in another file. This would be required for class FSTProcessor.
  • All the examples in documentation have function declaration in a seperate header file. Implemented swig interface file for function declaration in the cpp file. Required for wrapping endProgram() and checkValidity()
  • While working on an example for wrapping main(), ran into unexpected output. If the function name is modified to main2(), desired output is available. Will look into this and report back.
  • All header files require an interface file. SWIG provides header files for some standard libraries.
  • out of cstdlib, getopt, iostream and libgen the interface file is available only for iostream. Will look into usage of each header file, and implement an interface file for required header files.

@sushain97
Copy link
Member Author

sushain97 commented May 12, 2019

Working example: foo() is defined in foo.cpp, which calls bar(), defined in bar.cpp Successfully implemented a wrapper for foo.cpp.

Great!

Required for wrapping endProgram() and checkValidity()

There's no real need to wrap endProgram() or checkValidity(). You can just replicate what they do in Python. I suspect that if we wanted to wrap them, we we need to add a function declaration to them in the header file.

While working on an example for wrapping main(), ran into unexpected output. If the function name is modified to main2(), desired output is available. Will look into this and report back.

Why are we wrapping main? It looks to me like the SWIG docs recommend against it and I don't think it's needed either. The lines I noted above are really all that matters. We would need to replicate calls to them inside the Python but that's fine (especially for an MVP).

All header files require an interface file. SWIG provides header files for some standard libraries.

We don't need to actually call any code that hits getopt or libgen?

@TinoDidriksen
Copy link
Member

The SWIG .i file only needs to list the functions we want exported. So just don't list things from cstdlib or getopt or etc. Basically grep out the FST calls from lt_proc.cc and stuff those in a .i which includes the library headers in %{ %}.

The idea was never to wrap lt_proc.cc - it was to look in lt_proc.cc for what needs to be wrapped.

@sushain97
Copy link
Member Author

@Vaydheesh any luck? Tino's advice should be helpful.

@singh-lokendra
Copy link
Collaborator

I had exam on 14 and my next exam is tomorrow. So couldn't look into it. I'll try this after the exam

@sushain97
Copy link
Member Author

Sounds good. Best of luck on your exams!

@singh-lokendra
Copy link
Collaborator

  • Working on a wrapper that can copy bare functionality of lt-proc -w
  • While linking the object files with shared library, use flags for libxml
  • Demangled undefined symbols in shared library using c++filt
  • Resolved various undefined symbols in shared library using nm -Cu

@sushain97
Copy link
Member Author

sushain97 commented May 19, 2019

Hope your exams went well. That sounds like a good start!

I'd like to see a PR soon (maybe tomorrow) even if it's a WIP. Tino and/or I can take a quick look to double check that you're not going down the wrong path.

edit: note that everything should probably go in a Makefile, a la https://github.com/hfst/hfst/blob/master/python/Makefile.am

@sushain97
Copy link
Member Author

@Vaydheesh how's it going?

@singh-lokendra
Copy link
Collaborator

  • I've implemented the file descriptor approach
  • When I call init_analysis() for existing files, with text to be analysed present in the file, I am getting desired output
  • mode used for input and output files are r and w respectively.
  • If I create new file, write the text to be analysed in input_file, call init_analysis(), the output is not flushed to output file.
  • I've tried using r+ and w+ mode for input file, and w, r+, w+ for output file.
  • Trying to fix this issue, by isolating the function calls

Also,

  • Is it possible to get the mode in which a file stream is opened in C++, python equivalent is open('foo.txt').mode

@sushain97
Copy link
Member Author

I'm a bit confused. You should be using tempfile to create new temporary output and input files for each call. They should never be reused (at least for now). If https://docs.python.org/3/library/tempfile.html#tempfile.TemporaryFile isn't working well for you, try https://docs.python.org/3/library/tempfile.html#tempfile.NamedTemporaryFile.

@singh-lokendra
Copy link
Collaborator

create new temporary output and input files for each call.

I am using separate files for input and output

  • I started my work with tempfile, that didn't work
  • Hence I tried with this, but it didn't work as well
  • If I create new file, write the text to be analysed in input_file, call init_analysis(), the output is not flushed to output file.
  • Only then I tried using
  • When I call init_analysis() for existing files, with text to be analysed present in the file, I am getting desired output
  • Now I am trying to find the cause using brute force
  • I'll implement temporary file in the actual code

@sushain97
Copy link
Member Author

sushain97 commented May 23, 2019

Hmm, okay. Let me know if I can help (what's going wrong?). Something straightforward like

with tempfile.NamedTemporaryFile('w') as input_file, tempfile.NamedTemporaryFile('r') as output_file:
    input_file.write(input_text)
    input_file.flush() # might not be needed
    init_analysis(automorf_path, input_file.name, output_file.name)
    output = output_file.read()

If you can get the above working, don't worry about file descriptors for now. I'd rather get something to work and then improve it than immediately trying to jump to the optimal approach.

@singh-lokendra
Copy link
Collaborator

This is what I've run on interpreter

import tempfile
import analysis
with tempfile.NamedTemporaryFile('w') as input_file, tempfile.NamedTemporaryFile('r') as output_file:
    input_text = 'cats'
    automorf_path = "/usr/share/apertium/apertium-eng/eng.automorf.bin"
    input_file.write(input_text)
    input_file.flush() # might not be needed
    x = analysis.FST()
    x.init_analysis(automorf_path, input_file.name, output_file.name)
    output = output_file.read()
    print(output)
    input() # pause 

Output:
4 for writing 'cats'

When I checked the temporary file created in /tmp using cat $filename

  • One file has the text cats. file $filename prints /tmp/tmp5irc7w1t: ASCII text, with no line terminators
  • No output for the other file. file $filename prints /tmp/tmp704eeaz1: empty

@sushain97
Copy link
Member Author

print(output) shouldn't print 4. It should print whatever is in the file... Very odd.

I'm not sure if this will help at all but try using tempfile.NamedTemporaryFile('r', delete=False).

@sushain97
Copy link
Member Author

Also, instead of using input() to pause, do something like breakpoint(), then you can also inspect things.

@singh-lokendra
Copy link
Collaborator

print(output) shouldn't print 4. It should print whatever is in the file... Very odd

4 is output of input_file.write(input_text)
print(output) doesn't print anything

@singh-lokendra
Copy link
Collaborator

print(output) shouldn't print 4. It should print whatever is in the file... Very odd.

I'm not sure if this will help at all but try using tempfile.NamedTemporaryFile('r', delete=False).

The file persists, but it is empty

@sushain97
Copy link
Member Author

Try adding some debugging to the C++ (or using gdb I suppose) to figure out whether the issue is that there's no input being read or the output isn't being written? You can also try using https://en.cppreference.com/w/cpp/io/c/tmpfile and passing in/out strings. I don't like that approach since it requires more code in the C++ wrapper but lets see if we can get something to work first.

@singh-lokendra
Copy link
Collaborator

  • When I use following code, input_text gets written on the text_in.txt file, but text_out.txt is being filled with character ?, with its size increasing continuously, due to which I have to kill the process
void init_analysis(char *automorf_path, char *input_text, char output_text[50])
{
  setDictionaryCaseMode(true);
  LtLocale::tryToSetLocale();
  FILE *in = fopen(automorf_path, "rb");
  load(in);
  initAnalysis();
  checkValidity();
  FILE *input = fopen("text_in.txt", "wb+");
  FILE *output = fopen("text_out.txt", "wb+");
  fputs(input_text, input);
  rewind(input);    
  analysis(input, output);
  // fputs("foobar", output);
  rewind(output);    
  fgets(output_text, 50, output);
  fclose(in);
  fclose(input);
  fclose(output);
}
  • Upon replacing analysis(input, output); with fputs("foobar", output); the output file has the desired content

@sushain97
Copy link
Member Author

Ah, I thought already you had it working with non-temporary files. I'm not really sure why your example doesn't work. Maybe https://github.com/apertium/lttoolbox/blob/master/lttoolbox/lt_proc.cc#L308-L311? Shouldn't be relevant unless you're on Windows though. You shouldn't need to open them in append mode but that shouldn't matter.

@TinoDidriksen @unhammer any ideas? Probably something obvious.

@unhammer
Copy link
Member

What does the input contain? There are certain inputs that I've seen can lead lt-proc into endless loops (typically when stuff hasn't been through deformatters)

@singh-lokendra
Copy link
Collaborator

I've been working on the multiple flags functionality and I've made the relevant changes in all the wrappers. I just need to update the changes in utils.py. Can I complete this and then move onto the next task

@sushain97
Copy link
Member Author

Sure! I'm a bit worried that you're just creating code that will go out of sync with the real CLI interface though.

@singh-lokendra
Copy link
Collaborator

The wrapper can be updated by adding case in switch block (if the additional functionalities are kept in other .cc file). Its a different story if something like apertium-tagger can be implemented, which handles the argument parsing in a function itself.

@singh-lokendra
Copy link
Collaborator

Documentation could also be a good idea.

I am not well aware of the documentation process. Can you explain what needs to be done? Should I just update the https://github.com/apertium/apertium-python/blob/master/README.md and https://github.com/apertium/apertium-python/tree/master/docs with the usage?

@sushain97
Copy link
Member Author

The wrapper can be updated by adding case in switch block (if the additional functionalities are kept in other .cc file).

Yes. But someone who updates the options will have to do so in two places now, right?

I am not well aware of the documentation process. Can you explain what needs to be done? Should I just update the https://github.com/apertium/apertium-python/blob/master/README.md and https://github.com/apertium/apertium-python/tree/master/docs with the usage?

Here are the docs that are generated: https://apertium-python.readthedocs.io/en/latest/. They're mostly a result of the doc comments that need to be improved. The README can also use some improvement. For example, there are no docs regarding installing new packages.

@singh-lokendra
Copy link
Collaborator

Yes. But someone who updates the options will have to do so in two places now, right?

If those changes needs to be updated in the wrapper then yes

Here are the docs that are generated: https://apertium-python.readthedocs.io/en/latest/. They're mostly a result of the doc comments that need to be improved. The README can also use some improvement. For example, there are no docs regarding installing new packages.

👍

@sushain97
Copy link
Member Author

If those changes needs to be updated in the wrapper then yes

Is there a way for us to ensure that these updates actually occur? e.g. in the build or test?

@sushain97
Copy link
Member Author

or even better, a way to refactor and prevent there from being two places?

@singh-lokendra
Copy link
Collaborator

If the part of wrapping up the arguments can be avoided, then there wont be any need to maintain duplicates. Since all the parsing is being done in main(), its not possible to link its object (multiple definition of main would cause linking error) . If the parsing can be done in another function defined in another file, then that file can be linked to the shared library. something like https://github.com/apertium/apertium/blob/master/apertium/apertium_tagger.cc#L34 & https://github.com/apertium/apertium/blob/master/apertium/tagger.cc#L75

@singh-lokendra
Copy link
Collaborator

I am working on caching the object the wrapper object to prevent reinitialization

import lttoolbox
import os
import tempfile

def execute_lt_proc(fst, command, input_text, input_file, output_file):
    input_file.write(input_text)
    input_file.flush()
    fst.lt_proc(command, input_file.name, output_file.name)
    return output_file.read()

def analyze(input_text, fst):
    input_file = tempfile.NamedTemporaryFile(delete=False, mode='w')
    output_file = tempfile.NamedTemporaryFile(delete=False)
    command = ['lt-proc', '-w']
    output = execute_lt_proc(fst, command, input_text, input_file, output_file)
    os.remove(input_file.name)
    os.remove(output_file.name)
    print(output)

path = '/usr/share/apertium/apertium-eng/eng.automorf.bin'
fst = lttoolbox.FST(path)
analyze('cats\n', fst)

output is b'^cats/cat<n><pl>$\n', but calling analyze again, prints b'^cats/cat<n><pl>/cat<n><pl>$\n' and calling it third time prints, b'^cats/cat<n><pl>/cat<n><pl>/cat<n><pl>$\n'.

Re initializing the object fixes this. I have no clue about this unexpected behavior. Is it even possible to reuse the FSTProcessor?

@sushain97
Copy link
Member Author

Hmm... I'm not sure. @unhammer, @xavivars any thoughts? related code: https://github.com/apertium/lttoolbox/blob/master/python/lttoolbox.i#L22

@unhammer
Copy link
Member

That certainly is unexpected :-) I've used lttoolbox's biltrans from python (with a bare-bones C call ages ago) without problems, it should have worked.

What happens if you destxt the input, ie. send 'cats[][\n]' as input thrice? FSTProcessor::analysis isn't meant to be called on undeformatted input, and newlines typically get put in superblanks. (Could it be that FSTProcssor::analysis keeps analyses in some global variable that is only reset on, say superblank, or something like that?)

@singh-lokendra
Copy link
Collaborator

  • previous output is generated
>>> analyze('cats[][\n]', fst)
b'^cats/cat<n><pl>$[][\n]'
>>> analyze('cats[][\n]', fst)
b'^cats/cat<n><pl>/cat<n><pl>$[][\n]'
>>> analyze('cats[][\n]', fst)
b'^cats/cat<n><pl>/cat<n><pl>/cat<n><pl>$[][\n]'

@unhammer
Copy link
Member

Very strange. What exactly does fst.lt_proc call in the C++ code? Does it do initAnalysis each time before analysis, or just analysis? (initAnalysis seems to add extra paths to the fst)

@singh-lokendra
Copy link
Collaborator

initAnalysis each time before analysis, or just analysis

It calls initAnalysis each time before analysis.

(initAnalysis seems to add extra paths to the fst)

Calling initAnalysis once does fix this issue. Here is the equivalent c++ code https://termbin.com/irdm

@unhammer
Copy link
Member

If you look at what initAnalysis does, it does make sense that you'd end up with duplicate paths on calling it several times.

However, this complicates things a bit – you can't simply do fst=lttoolbox.FST and then, once, do fst.initAnalysis and then fst.analysis(string) because what if the user then does initGeneration and wants to try what happens with generation etc.

Maybe you could make the API be something like fsta = lttoolbox.FST_for_analysis(path) and fstg = lttoolbox.FST_for_generation(path) where FST_for_analysis calls initAnalysis before returning it? Or what do you think @sushain97 ?

@sushain97
Copy link
Member Author

Well, given that we're not trying to expose a perfect lttoolbox API for Python, I'm fine with the latter option. It should be possible in the current setup.

@singh-lokendra
Copy link
Collaborator

singh-lokendra commented Aug 22, 2019

after getting familiar with typemaps I tried wrapping FILE *, RN there are some issues with using multiple argument typemaps, it is possible to wrap the FILE * without depending on PyFile_AsFile #42 (comment). Using FILE * would allow a thin wrapper on the C++ API calls and the required init calls as well as other function calls can be implemented in python super wrapper. Below is an MVP for this approach:

@sushain97
Copy link
Member Author

That seems okay to me if it works but I'm not sure how it's related to the stuff above?

@singh-lokendra
Copy link
Collaborator

Maybe you could make the API be something like fsta = lttoolbox.FST_for_analysis(path) and fstg = lttoolbox.FST_for_generation(path) where FST_for_analysis calls initAnalysis before returning it?

By wrapping FILE * separate init calls for every argument won't be required

@sushain97
Copy link
Member Author

Hmm, I'm not seeing why what Unhammer suggested isn't possible via the current approach of passing in temporary path strings.

@singh-lokendra
Copy link
Collaborator

Can you have a look at the http://wiki.apertium.org/wiki/User:Vaydheesh/GSoC2019Report to be submitted as GSoC report

@sushain97
Copy link
Member Author

It LGTM! I'm not too worried about the GSoC report :) Prefer you spend the last bits of time on the PRs.

@singh-lokendra
Copy link
Collaborator

https://github.com/apertium/apertium-python/blob/master/apertium/utils.py#L191

is this TODO about installing language packages on system?

@sushain97
Copy link
Member Author

sushain97 commented Aug 29, 2019

That TODO is actually copied from apertium-apy. I think the concern is that we're parsing .modes files instead of the source modes.xml files. I don't know if the latter are available in an installation from the package manager?

@TinoDidriksen
Copy link
Member

modes.xml is not installed. The worry of "what if a path has | or ' in it?" is technically possible, but highly unlikely to happen - it would break many things, so people know not to be that degenerate.

@singh-lokendra
Copy link
Collaborator

@sushain97
Copy link
Member Author

I don't think we need to do anything on that topic until/if people complain.

@singh-lokendra
Copy link
Collaborator

singh-lokendra commented Sep 3, 2019

I am working on various issues opened, but suddenly appveyor builds started failing due to generator tests. It seems some unwanted characters are still present in the output. No changes were done in generator code and travis builds are passing https://ci.appveyor.com/project/apertium/apertium-python/builds/27152942/job/u7iel6ljkqlm683v#L234

python -c "import apertium; print(apertium.generate('en', '^cat<n><pl>$'))"
[2019-09-03 16:43:08,461] {C:\projects\apertium-python\apertium\utils.py:173} WARNING - Calling subprocess lt-proc
#\^cat

Edit: This weird behavior was caused due to a unit test. I've fixed the appveyor builds

@sushain97
Copy link
Member Author

Edit: This weird behavior was caused due to a unit test. I've fixed the appveyor builds

Awesome!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants