Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak in parse_string? (or misleading documentation?) #45

Open
samoconnor opened this issue Mar 13, 2016 · 19 comments
Open

Memory leak in parse_string? (or misleading documentation?) #45

samoconnor opened this issue Mar 13, 2016 · 19 comments

Comments

@samoconnor
Copy link

I received a report of memory leaks from an XMLDict user JuliaCloud/XMLDict.jl#1.

I'm somewhat surprised. The LightXML.jl doc says, under the heading "Create an XML Document", that "When you create XML documents and elements directly you need to take care not to leak memory". However, under the heading "Read an XML file" there is no mention of memory management. XMLDict.jl only calls LightXML.parse_string, so it does not "create XML documents and elements directly".

The documentation seems to say that if you are just "read[ing] an xml file", as opposed to "create[ing] XML documents and elements directly", then you do not need to worry about memory management.

Is parse_string broken? or is the documentation broken?

This seems related: #19

@tkelman
Copy link
Contributor

tkelman commented Mar 13, 2016

https://github.com/JuliaLang/LightXML.jl/blob/d31f1f43dc04fd1546ec5e4ad039c6f1981892fd/src/document.jl#L103
XMLDocument(p) is creating an XML document. So the word "directly" should probably be adjusted. If you know you're never adding children or other references to the same underlying pointer you can manually add a finalizer to any XMLDocuments you create. I don't think it's safe to do that in general without bookkeeping reference counts in all of the Julia types.

@samoconnor
Copy link
Author

I suggest at a minimum:

  • Add a finalizer in the parse_* methods
  • Describe in the documentation: under what circumstances users of parse_* could still get into trouble with memory management.
  • Provide an example of "this is ok", and an example of "this is not ok", and "here is how to handle it".

@tkelman
Copy link
Contributor

tkelman commented Mar 13, 2016

My concern is if you do parse_string to create an initial XMLDocument, then add some children to it, if it has a finalizer and the Julia XMLDocument object goes out of scope and gets garbage collected but the children are still alive, you could cause a segfault since the Julia GC doesn't know about the C-level reference between the two objects.

@samoconnor
Copy link
Author

So, per my suggestion above, please describe that in the documentation.
i.e. say that everything is fine if...

  • you don't manually add children.
  • when you do add children, you make sure to hold a reference to the parent doc until after you're done with the children.

And, please provide examples of correct usage in the documentation.

(FWIW, a segfault is far preferable to a memory leak. A segfault gets found in testing. A memory leak causes stuff to break months later in production and costs way more to fix.)

@yuyichao
Copy link

A segfault caused by a GC will not necessarily be found in tests.

@yuyichao
Copy link

I'm not familiar with libxml2 API but is the memory own'ed by a root object and is it always possible to find know what the owner object is when the library returns a pointer (either from the argument or from some libxml2 api)?

@tkelman
Copy link
Contributor

tkelman commented Mar 13, 2016

I don't know exactly how to answer your question @yuyichao. libxml2 is a C library and until someone implements a full reference count system here (which is what I believe the R wrapper and maybe also the Python wrapper do), you're responsible for C-like manual memory management when using it through this wrapper.

@yuyichao
Copy link

What I mean is that we don't necessarily need to keep track of reference counting and we can instead track the reference directly. This may not be simpler in general (although also won't be much more complicated since for RC you also need to lookup the julia object corresponding to a libxml2 pointer) but in the case where each object has a clear owner this would be much simpler.

My question was basically if this is true for libxml2. For example, can a xml node belong to multiple xml document simultaneously? And if this is reasonable to keep track of given the API. (i.e. if it is clear from the API when and how the ownership is transfered.)

@tkelman
Copy link
Contributor

tkelman commented Mar 14, 2016

can a xml node belong to multiple xml document simultaneously

Looks like.

julia> xdoc = XMLDocument()
<?xml version="1.0" encoding="utf-8"?>
julia> xroot = create_root(xdoc, "States")
<States/>
julia> xs1 = new_child(xroot, "State")
<State/>
julia> xdoc2 = XMLDocument()
<?xml version="1.0" encoding="utf-8"?>
julia> xdoc
<?xml version="1.0" encoding="utf-8"?>
<States>
  <State/>
</States>
julia> xroot2 = create_root(xdoc2, "States2")
<States2/>
julia> add_child(xroot2, xs1)
true
julia> xroot2
<States2>
  <State/>
</States2>
julia> xs2 = new_child(xs1, "StateChild2")
<StateChild2/>
julia> xdoc
<?xml version="1.0" encoding="utf-8"?>
<States>
  <State>
    <StateChild2/>
  </State>
</States>
julia> xdoc2
<?xml version="1.0" encoding="utf-8"?>
<States2>
  <State>
    <StateChild2/>
  </State>
</States2>

@yuyichao
Copy link

Actually it looks like this is not allowed since freeing both xdoc and xdoc2 here causes a segfault.

julia> using LightXML

julia> xdoc = XMLDocument()
<?xml version="1.0" encoding="utf-8"?>


julia> xroot = create_root(xdoc, "States")
<States/>

julia> xs1 = new_child(xroot, "State")
<State/>

julia> xdoc2 = XMLDocument()
<?xml version="1.0" encoding="utf-8"?>


julia> xroot2 = create_root(xdoc2, "States2")
<States2/>

julia> add_child(xroot2, xs1)
true

julia> xroot2
<States2>
  <State/>
</States2>

julia> xroot
<States>
  <State/>
</States>

julia> fre
free  frexp
julia> free
free (generic function with 4 methods)

julia> free(xdoc)
Ptr{Void} @0x0000000000000000

julia> free(xdoc2)

signal (11): 段错误
while loading no file, in expression starting on line 0
xmlFreeNodeList at /usr/bin/../lib/libxml2.so (unknown line)
xmlFreeNodeList at /usr/bin/../lib/libxml2.so (unknown line)
xmlFreeDoc at /usr/bin/../lib/libxml2.so (unknown line)
free at /home/yuyichao/projects/contrib/LightXML.jl/src/document.jl:71
unknown function (ip: 0x7fa6af99c045)
[inline] at /home/yuyichao/projects/arch-pkg/pkg/all/julia-git/src/julia/src/julia_internal.h:69
jl_call_method_internal at /home/yuyichao/projects/arch-pkg/pkg/all/julia-git/src/julia/src/gf.c:1879
do_call at /home/yuyichao/projects/arch-pkg/pkg/all/julia-git/src/julia/src/interpreter.c:66
eval at /home/yuyichao/projects/arch-pkg/pkg/all/julia-git/src/julia/src/interpreter.c:185
jl_toplevel_eval_flex at /home/yuyichao/projects/arch-pkg/pkg/all/julia-git/src/julia/src/toplevel.c:561
[inline] at /home/yuyichao/projects/arch-pkg/pkg/all/julia-git/src/julia/src/julia.h:1379
jl_eh_restore_state at /home/yuyichao/projects/arch-pkg/pkg/all/julia-git/src/julia/src/builtins.c:548
eval at ./boot.jl:267
[inline] at /home/yuyichao/projects/arch-pkg/pkg/all/julia-git/src/julia/src/julia_internal.h:69
jl_call_method_internal at /home/yuyichao/projects/arch-pkg/pkg/all/julia-git/src/julia/src/gf.c:1879
[inline] at ./REPL.jl:3
eval_user_input at ./REPL.jl:62
unknown function (ip: 0x7fa6afbca5b7)
[inline] at /home/yuyichao/projects/arch-pkg/pkg/all/julia-git/src/julia/src/julia_internal.h:69
jl_call_method_internal at /home/yuyichao/projects/arch-pkg/pkg/all/julia-git/src/julia/src/gf.c:1879
[inline] at ./REPL.jl:92
#1 at ./task.jl:59
unknown function (ip: 0x7fa8b80c1560)
[inline] at /home/yuyichao/projects/arch-pkg/pkg/all/julia-git/src/julia/src/julia_internal.h:69
jl_call_method_internal at /home/yuyichao/projects/arch-pkg/pkg/all/julia-git/src/julia/src/gf.c:1903
start_task at /home/yuyichao/projects/arch-pkg/pkg/all/julia-git/src/julia/src/task.c:245
unknown function (ip: (nil))
Allocations: 4859550 (Pool: 4857857; Big: 1693); GC: 10

@tkelman
Copy link
Contributor

tkelman commented Mar 14, 2016

That's expected, xmlFreeDoc frees all children http://xmlsoft.org/html/libxml-tree.html#xmlFreeDoc

@yuyichao
Copy link

Then is there a way to free all the memory allocated without segfault after creating this?

@yuyichao
Copy link

xmlNode for example also have a parent field so I think it is not supposed to be added to two trees simultaneously.

@tkelman
Copy link
Contributor

tkelman commented Mar 14, 2016

It might be one of those things that the library lets you do but you're not supposed to do it? If we want to do something clever and automatic here, I think we'd really have to carefully study how some other proven, well-tested wrapper of libgit2 libxml2 for some garbage collected language handles these things.

@yuyichao
Copy link

IIUC lxml.etree uses libxml2 and when you try to add a node to another tree, it is deleted from the original one.

In [1]: from lxml import etree

In [2]: root = etree.Element("root")

In [3]: root2 = etree.Element("root2")

In [4]: c1 = etree.Element("child1")

In [5]: root.append(c1)

In [6]: print(etree.tostring(root).decode())
<root><child1/></root>

In [7]: print(etree.tostring(root2).decode())
<root2/>

In [8]: root2.append(c1)

In [9]: print(etree.tostring(root).decode())
<root/>

In [10]: print(etree.tostring(root2).decode())
<root2><child1/></root2>

@nalimilan
Copy link
Member

FWIW, here's a detailed description of how this works in the R wrapper for libxml2:
https://github.com/omegahat/XML/blob/master/Docs/MemoryManagement.pdf

@samoconnor
Copy link
Author

In XMLDict, I've added finalizer(doc, LightXML.free) after parse_string, and finalize(doc) after doc has been converted to a Dict. JuliaCloud/XMLDict.jl@146af78

The user who reported JuliaCloud/XMLDict.jl#1 says that they can now process large (6 gb) files with constant memory usage.

I'd like to reiterate my request to the maintainers of LightXML.jl: Please add, ASAP, a description and and example to the documentation for the simple read-only case that says "if you do x, y and z you'll be fine". I understand that it might take time to figure out a robust way to handle all the corner cases, but in the meantime, the documentation should at least tell the user how to use the library safely in its current form.

@tkelman
Copy link
Contributor

tkelman commented Mar 15, 2016

Pull requests encouraged.

@samoconnor
Copy link
Author

FYI I made a further tweak to the XMLDict.jl work-around: JuliaCloud/XMLDict.jl@d7b63d0
-- "keep back refernce to top level XMLDocument object to avoid premature finalisation".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants