Tie Breaker Error #134

0xSheller · 2024-07-11T02:07:17Z

I've been rooting out this problem for a few hours now and I've finally figured out where it's stemming from after concluding it's not in my implementation/code base.

Issue:
When detecting dialects, and two or more dialects have the exact same score, the tiebreaker returns NoneType.

See:

SimpleDialect('\t', '', ''):    P =   295149.000000     T =        0.997842     Q =   294512.000000
SimpleDialect('\t', '"', ''):   P =   128607.250000     skip.
SimpleDialect('\t', '"', '\\'): P =   295149.000000     T =        0.997842     Q =   294512.000000

This specifically yields 2 dialects that go into the break_ties_two function. I'm unsure if this affects the other tiebreakers:

break_ties_three
break_ties_four

Modifying the detect function to print results:

    def detect(
        self, data: str, delimiters: Optional[List[str]] = None
    ) -> Optional[SimpleDialect]:
        """Detect the dialect using the consistency measure

        Parameters
        ----------
        data : str
            The data of the file as a string

        delimiters : iterable
            List of delimiters to consider. If None, the :func:`get_delimiters`
            function is used to automatically detect this (as described in the
            paper).

        Returns
        -------
        dialect : SimpleDialect
            The detected dialect. If no dialect could be detected, returns None.

        """
        self._cached_is_known_type.cache_clear()

        # TODO: probably some optimization there too
        dialects = get_dialects(data, delimiters=delimiters)

        # TODO: This is not thread-safe and this object can simply own a Parser
        # for each dialect and set the limit directly there (we can also cache
        # the best parsing result)
        old_limit = field_size_limit(len(data) + 1)

        scores = self.compute_consistency_scores(data, dialects)
        best_dialects = ConsistencyDetector.get_best_dialects(scores)
        result: Optional[SimpleDialect] = None
        if len(best_dialects) == 1:
            result = best_dialects[0]
        else:
           print(len(best_dialects)) # << Here
            result = tie_breaker(data, best_dialects)
            print(type(result)) # << Here
            print(result) # << Here

        field_size_limit(old_limit)
        return result

Yields:

2
<class 'NoneType'>
None

This tells me the tie breaker isn't doing it's job.

Further Analysis:

Ambiguous Tie Breakers: The function relies on several specific conditions to break ties between dialects, such as differences in quotechar, delimiter, or escapechar. If these conditions are not met, or if the distinctions are not sufficient to determine a clear winner (i.e., none of the predefined conditions apply or the conditions apply but lead to a tie), the function defaults to returning None.
Incomplete Handling for Non-Specific Cases: The current implementation does not cover cases where dialects differ in ways that are not specified or when the parsing results in identical outputs for the specific attributes being compared.

The text was updated successfully, but these errors were encountered:

Fix alan-turing-institute#134

ws-garcia · 2024-10-16T17:56:17Z

As the research paper points: "If it is not possible to break the tie reliably, our method returns no result. In a practical setting, this is preferred over returning an incorrect result."

So, there is no issues with CleverCSV on this subject. The tool is working as expected.

0xSheller · 2024-10-23T00:16:09Z

In what scenario would a none type response be better?

In my use-case, tie breaker fails when there's two dialects where it may or may not have quotation marks. As in, it doesn't matter. a better approach would be returning which ever is highest score, both, or one.

As the research paper points: "If it is not possible to break the tie reliably, our method returns no result. In a practical setting, this is preferred over returning an incorrect result."

So, there is no issues with CleverCSV on this subject. The tool is working as expected.

ws-garcia · 2024-10-23T00:20:45Z

If you are running CleverCSV in verbose mode, you can check dialects having highest scores. At this point, human intervention is needed, so manually select the one that suits your needs.

0xSheller · 2024-10-23T00:21:56Z

Gotchya, that seems counter productive in my use-case specifically. I still think in such cases the response should be a list of possible dialects rather than none. Anyways marking as closed since it appears there isn't much interest here besides me.

ws-garcia · 2024-10-23T00:26:29Z

I have developed a mechanism for dialect detection in Python. The result is always a dialect, the first having the highest score. Then, the user dialect predefined ordering is predominant. Feel free to try it. https://content.iospress.com/articles/data-science/ds240062

0xSheller · 2024-10-23T02:06:04Z

will do! Thank you, interesting read so far.

ws-garcia · 2024-10-23T02:12:12Z

@0xSheller, here is the GitHub repo.

0xSheller added a commit to 0xSheller/CleverCSV that referenced this issue Jul 11, 2024

Update break_ties.py

3dbd34a

Fix alan-turing-institute#134

0xSheller mentioned this issue Jul 11, 2024

Update break_ties.py #135

Open

0xSheller closed this as completed Oct 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tie Breaker Error #134

Tie Breaker Error #134

0xSheller commented Jul 11, 2024

ws-garcia commented Oct 16, 2024 •

edited

Loading

0xSheller commented Oct 23, 2024

ws-garcia commented Oct 23, 2024

0xSheller commented Oct 23, 2024

ws-garcia commented Oct 23, 2024 •

edited

Loading

0xSheller commented Oct 23, 2024

ws-garcia commented Oct 23, 2024

Tie Breaker Error #134

Tie Breaker Error #134

Comments

0xSheller commented Jul 11, 2024

ws-garcia commented Oct 16, 2024 • edited Loading

0xSheller commented Oct 23, 2024

ws-garcia commented Oct 23, 2024

0xSheller commented Oct 23, 2024

ws-garcia commented Oct 23, 2024 • edited Loading

0xSheller commented Oct 23, 2024

ws-garcia commented Oct 23, 2024

ws-garcia commented Oct 16, 2024 •

edited

Loading

ws-garcia commented Oct 23, 2024 •

edited

Loading