Allow `process.detect_types` to match last type instead of the first #34

SteadBytes · 2020-08-07T13:13:02Z

Float values with zero as the fractional component e.g. '0.0', '0.1', '1.00' are detected as int instead of float. This is because they can be parsed as int according to fntools.is_int. Although the data could be interpreted as an integer, given that the source has a decimal place I would argue that detect_types should not perform casting to an integer. For example, database reports may include data from float/decimal columns which, just by chance, have no fractional component however this doesn't mean they should not be treated as floats.

Example Test Case

diff --git a/tests/test_process.py b/tests/test_process.py
index cc538a2..9b720e5 100644
--- a/tests/test_process.py
+++ b/tests/test_process.py
@@ -76,6 +76,12 @@ class Test:
         nt.assert_equal(Decimal('0.87'), result['confidence'])
         nt.assert_false(result['accurate'])
 
+    def test_detect_types_floats_zero_fractional_component(self):
+        records = it.cycle([{"foo": '0.0'}, {"foo": "1.0"}, {"foo": "10.00"}])
+        records, result = pr.detect_types(records)
+
+        nt.assert_equal(result["types"], [{"id": "foo", "type": "float"}])
+
     def test_fillempty(self):
         records = [
             {'a': '1', 'b': '27', 'c': ''},

Fails with:

AssertionError: Lists differ: [{'id': 'foo', 'type': 'int'}] != [{'id': 'foo', 'type': 'float'}]

First differing element 0:
{'id': 'foo', 'type': 'int'}
{'id': 'foo', 'type': 'float'}

- [{'id': 'foo', 'type': 'int'}]
?                         ^^

+ [{'id': 'foo', 'type': 'float'}]
?                         ^^^^

Potential Solutions

Prefer stricter type inference for floats by default; e.g. if it has decimal places, it's a float.
Allow stricter type inference for floats via an option e.g. a kwarg to detect_types that is passed down to is_int to change the behaviour from "can this be parsed as an int" to "this is definitely an int"
Any other ideas of course!

I am happy to implement the changes required after a decision is made on the correct behaviour 😄

The text was updated successfully, but these errors were encountered:

reubano · 2021-12-25T02:21:47Z

Interesting.... so by default, detect_types needs 17 records to reach 95% confidence.

meza/meza/process.py

Lines 278 to 283 in 524b2fd

    
                       detect_types(records, 0.9, 3)['count'] == 23 
        
                       detect_types(records, 0.9, 4)['count'] == 10 
        
                       detect_types(records, 0.9, 5)['count'] == 6 
        
                       detect_types(records, 0.95, 5)['count'] == 31 
        
                       detect_types(records, 0.95, 6)['count'] == 17 
        
                       detect_types(records, 0.95, 7)['count'] == 11

If you have 17 floats with .0 then I'm pretty tempted to say you have an int more likely than not.

The general issue is, "what to do anytime content matches a higher guess_func when it could also be considered for a lower one"? The algorithm matches the first type found.

meza/meza/process.py

Lines 264 to 266 in 524b2fd

    
           def detect_types(records, min_conf=0.95, hweight=6, max_iter=100): 
        
               """Detects record types by selecting the first type which reaches the 
        
               minimum confidence level (based on number of hits).

meza/meza/typetools.py

Lines 160 to 169 in 524b2fd

    
           guess_funcs = [ 
        
               {"type": "null", "func": null_func}, 
        
               {"type": "bool", "func": ft.is_bool}, 
        
               {"type": "int", "func": int_func}, 
        
               {"type": "float", "func": float_func}, 
        
               {"type": "datetime", "func": is_datetime}, 
        
               {"type": "time", "func": is_time}, 
        
               {"type": "date", "func": is_date}, 
        
               {"type": "text", "func": lambda x: hasattr(x, "lower")}, 
        
           ]

E.g.,

        >>> records = it.cycle(
        ...    [
        ...        {"int": 1, "float": "1.0"},
        ...        {"int": 1, "float": "10.00"},
        ...        {"int": 0, "float": "0.0"}
        ...    ]
        ... )
        >>> detect_types(records)[1]['types']
        [{'id': 'int', 'type': 'bool'}, {'id': 'float', 'type': 'int'}]
        >>> records = it.cycle(
        ...    [
        ...        {"int": 1, "float": "1.0"},
        ...        {"int": 1, "float": "10.00"},
        ...        {"int": 0, "float": "0.0"},
        ...        {"int": 2, "float": "0.1"}
        ...    ]
        ... )
        >>> detect_types(records)[1]['types']
        [{'id': 'int', 'type': 'int'}, {'id': 'float', 'type': 'float'}]

reubano · 2021-12-25T02:26:27Z

One potential solution is an upcast kwarg that detect_types accepts which will cause it to match the last type found instead of the first.

detect_types(records, upcast=True)[1]['types']

SteadBytes changed the title ~~Floats with zero fractional component are detected as ints during type detection~~ Floats with zero fractional component are detected as ints by detect_types Aug 7, 2020

SteadBytes changed the title ~~Floats with zero fractional component are detected as ints by detect_types~~ Floats with zero fractional component are detected as ints by process.detect_types Aug 7, 2020

SteadBytes mentioned this issue Aug 7, 2020

Datetimes with all-zero time component (exact midnight) are detected as dates #35

Closed

reubano added enhancement help wanted labels Dec 25, 2021

reubano changed the title ~~Floats with zero fractional component are detected as ints by process.detect_types~~ Allow process.detect_types to match last type instead of the first Dec 25, 2021

reubano changed the title ~~Allow process.detect_types to match last type instead of the first~~ Allow process.detect_types to match last type instead of the first Dec 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow `process.detect_types` to match last type instead of the first #34

Allow `process.detect_types` to match last type instead of the first #34

SteadBytes commented Aug 7, 2020

reubano commented Dec 25, 2021 •

edited

reubano commented Dec 25, 2021 •

edited

Allow process.detect_types to match last type instead of the first #34

Allow process.detect_types to match last type instead of the first #34

Comments

SteadBytes commented Aug 7, 2020

Example Test Case

Potential Solutions

reubano commented Dec 25, 2021 • edited

reubano commented Dec 25, 2021 • edited

Allow `process.detect_types` to match last type instead of the first #34

Allow `process.detect_types` to match last type instead of the first #34

reubano commented Dec 25, 2021 •

edited

reubano commented Dec 25, 2021 •

edited