Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow process.detect_types to match last type instead of the first #34

Open
SteadBytes opened this issue Aug 7, 2020 · 2 comments
Open

Comments

@SteadBytes
Copy link

Float values with zero as the fractional component e.g. '0.0', '0.1', '1.00' are detected as int instead of float. This is because they can be parsed as int according to fntools.is_int. Although the data could be interpreted as an integer, given that the source has a decimal place I would argue that detect_types should not perform casting to an integer. For example, database reports may include data from float/decimal columns which, just by chance, have no fractional component however this doesn't mean they should not be treated as floats.

Example Test Case

diff --git a/tests/test_process.py b/tests/test_process.py
index cc538a2..9b720e5 100644
--- a/tests/test_process.py
+++ b/tests/test_process.py
@@ -76,6 +76,12 @@ class Test:
         nt.assert_equal(Decimal('0.87'), result['confidence'])
         nt.assert_false(result['accurate'])
 
+    def test_detect_types_floats_zero_fractional_component(self):
+        records = it.cycle([{"foo": '0.0'}, {"foo": "1.0"}, {"foo": "10.00"}])
+        records, result = pr.detect_types(records)
+
+        nt.assert_equal(result["types"], [{"id": "foo", "type": "float"}])
+
     def test_fillempty(self):
         records = [
             {'a': '1', 'b': '27', 'c': ''},

Fails with:

AssertionError: Lists differ: [{'id': 'foo', 'type': 'int'}] != [{'id': 'foo', 'type': 'float'}]

First differing element 0:
{'id': 'foo', 'type': 'int'}
{'id': 'foo', 'type': 'float'}

- [{'id': 'foo', 'type': 'int'}]
?                         ^^

+ [{'id': 'foo', 'type': 'float'}]
?                         ^^^^

Potential Solutions

  • Prefer stricter type inference for floats by default; e.g. if it has decimal places, it's a float.
  • Allow stricter type inference for floats via an option e.g. a kwarg to detect_types that is passed down to is_int to change the behaviour from "can this be parsed as an int" to "this is definitely an int"
  • Any other ideas of course!

I am happy to implement the changes required after a decision is made on the correct behaviour 😄

@SteadBytes SteadBytes changed the title Floats with zero fractional component are detected as ints during type detection Floats with zero fractional component are detected as ints by detect_types Aug 7, 2020
@SteadBytes SteadBytes changed the title Floats with zero fractional component are detected as ints by detect_types Floats with zero fractional component are detected as ints by process.detect_types Aug 7, 2020
@reubano
Copy link
Owner

reubano commented Dec 25, 2021

Interesting.... so by default, detect_types needs 17 records to reach 95% confidence.

meza/meza/process.py

Lines 278 to 283 in 524b2fd

detect_types(records, 0.9, 3)['count'] == 23
detect_types(records, 0.9, 4)['count'] == 10
detect_types(records, 0.9, 5)['count'] == 6
detect_types(records, 0.95, 5)['count'] == 31
detect_types(records, 0.95, 6)['count'] == 17
detect_types(records, 0.95, 7)['count'] == 11

If you have 17 floats with .0 then I'm pretty tempted to say you have an int more likely than not.

The general issue is, "what to do anytime content matches a higher guess_func when it could also be considered for a lower one"? The algorithm matches the first type found.

meza/meza/process.py

Lines 264 to 266 in 524b2fd

def detect_types(records, min_conf=0.95, hweight=6, max_iter=100):
"""Detects record types by selecting the first type which reaches the
minimum confidence level (based on number of hits).

meza/meza/typetools.py

Lines 160 to 169 in 524b2fd

guess_funcs = [
{"type": "null", "func": null_func},
{"type": "bool", "func": ft.is_bool},
{"type": "int", "func": int_func},
{"type": "float", "func": float_func},
{"type": "datetime", "func": is_datetime},
{"type": "time", "func": is_time},
{"type": "date", "func": is_date},
{"type": "text", "func": lambda x: hasattr(x, "lower")},
]

E.g.,

        >>> records = it.cycle(
        ...    [
        ...        {"int": 1, "float": "1.0"},
        ...        {"int": 1, "float": "10.00"},
        ...        {"int": 0, "float": "0.0"}
        ...    ]
        ... )
        >>> detect_types(records)[1]['types']
        [{'id': 'int', 'type': 'bool'}, {'id': 'float', 'type': 'int'}]
        >>> records = it.cycle(
        ...    [
        ...        {"int": 1, "float": "1.0"},
        ...        {"int": 1, "float": "10.00"},
        ...        {"int": 0, "float": "0.0"},
        ...        {"int": 2, "float": "0.1"}
        ...    ]
        ... )
        >>> detect_types(records)[1]['types']
        [{'id': 'int', 'type': 'int'}, {'id': 'float', 'type': 'float'}]

@reubano
Copy link
Owner

reubano commented Dec 25, 2021

One potential solution is an upcast kwarg that detect_types accepts which will cause it to match the last type found instead of the first.

detect_types(records, upcast=True)[1]['types']

@reubano reubano changed the title Floats with zero fractional component are detected as ints by process.detect_types Allow process.detect_types to match last type instead of the first Dec 25, 2021
@reubano reubano changed the title Allow process.detect_types to match last type instead of the first Allow process.detect_types to match last type instead of the first Dec 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants