Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unfound links #8

Open
RuyBlast opened this issue Aug 19, 2023 · 2 comments
Open

unfound links #8

RuyBlast opened this issue Aug 19, 2023 · 2 comments

Comments

@RuyBlast
Copy link

RuyBlast commented Aug 19, 2023

Hi, I had troubles with unfound links so I built a minimal example to reproduce the bug:

ancors.ml

module S = Soup
module M = Mechaml

let url = "https://fr.wikipedia.org/wiki/OCaml"

module Debug = struct
  let pretty_print page css_selector =
    let soup = M.Page.soup page in
    let node = S.R.select_one css_selector soup in
    print_endline (S.pretty_print node)
end

let action () =
  let open M.Agent.Monad.Infix in
  M.Agent.get url
  >|= (fun http_res ->
    let page = M.Agent.HttpResponse.page http_res in
    let infobox_selector = ".infobox_v2 > tbody" in

    (* select all descendants of the infobox *)
    let link_selector = Printf.sprintf "%s *" infobox_selector in

    (* checks that selector find first child element of infobox *)
    Debug.pretty_print page link_selector;

    (* is supposed to collect all links in the infobox *)
    let link_seq = M.Page.links_with link_selector page in
    let l = M.Page.to_list link_seq in

    (* prints 0, while there is ~25 links in infobox *)
    print_int (List.length l);
    print_newline ())

let () = snd (M.Agent.Monad.run (M.Agent.init ()) (action ()))

dune

(executable (name ancors)
 (libraries lambdasoup mechaml))

My version of Mechamel is 1.2.1.
I do not know what is the problem, maybe relative links ? I hope I did not make a trivial usage mistake of the library though. As a supplementary information, if I try to select all links of this page, I get about a list of length ~500.

Best,

@RuyBlast
Copy link
Author

Hey again, I have investigated a bit and found the source of the problem, inside page.ml :

let tag_selector tag = function
  | "" -> tag
  | s when s.[0]='*' -> s
  | s when is_identifier_char s.[0] -> s
  | s -> tag^s

(* ... *)

let links_with selector p =
  p.soup
  |> Soup.select (tag_selector "a" selector)
  |> Soup.filter (tag_filter "a")
  |> seq_from_nodes (fun elt -> Link.( {resolver = resolver p; elt} ))

I see the use-case that was in mind when developing this was feeding links_with with a css attribute selector (aka surrounded by "[]"). Unfortunately, it does not work at all when we want to specify that the links we are looking for are descendants of elements with class .x or id #y.

A workaround is prefixing the css selector we give to links_with (and other _with functions) with "html ", as an example.

To fix this issue, I suggest either:

  • to add these two lines to tag_selector function:
  | s when s.[0]='#' -> s
  | s when s.[0]='.' -> s
  • or to simply suppress the tag_selector function and just let people write right their css selector (and eventually document it is on the user to provide the complete css selector for elements to match) to match elements they want.

I have a preference for the second solution because it is more predictable for users, and easy to fix. The first is even simpler, but maybe not exhaustive (now or in the future). Please tell me what you prefer so that I can make a PR accordingly ;)

@yannham
Copy link
Owner

yannham commented Sep 6, 2023

Hi @RuyBlast, thanks for the investigation! I believe the second solution is unfortunately backward incompatible. So, for now, I would vouch for the first one, which is simple and doesn't change current behavior. This lets us time to think about the second one 🙂 (maybe we can just add an alternative function to do so, links_with_no_tag_selector_but_obviously_a_better_name). Will happily review your PR and make a new minor release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants