Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compatibility issue with numexpr engine and some eval calls #436

Open
niclaswue opened this issue May 15, 2024 · 3 comments
Open

Compatibility issue with numexpr engine and some eval calls #436

niclaswue opened this issue May 15, 2024 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@niclaswue
Copy link
Contributor

niclaswue commented May 15, 2024

Hey Xavier,

I just discovered the library and really like it so far, great job! When tinkering a bit, I tried to work with the SCAT dataset and copied this test case into my notebook:

s = SCAT("scat20161015_20161021.zip", nflights=10)

to my surprise I got this error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[1], line 3
      1 from traffic.data.datasets.scat import SCAT
----> 3 s = SCAT("scat20161112_20161118.zip", nflights=10)

File ~/miniconda3/envs/traffic/lib/python3.10/site-packages/traffic/data/datasets/scat.py:121, in SCAT.__init__(self, ident, nflights)
    118 if "grib_meteo" in file_info.filename:
    119     continue
--> 121 entry = self.parse_zipinfo(zf, file_info)
    122 flights.append(entry.flight)
    123 flight_plans.append(entry.flight_plan)

File ~/miniconda3/envs/traffic/lib/python3.10/site-packages/traffic/data/datasets/scat.py:64, in SCAT.parse_zipinfo(self, zf, file_info)
     58 decoded = json.loads(content_bytes.decode())
     59 flight_id = str(decoded["id"])  # noqa: F841
     61 flight_plan = (
     62     pd.json_normalize(decoded["fpl"]["fpl_plan_update"])
     63     .rename(columns=rename_columns)
---> 64     .eval(
     65         """
     66     timestamp = @pd.to_datetime(timestamp, utc=True, format="mixed", errors="coerce")
     67     flight_id = @flight_id
     68     """
     69     )
     70 )
     72 clearance = (
     73     pd.json_normalize(decoded["fpl"]["fpl_clearance"])
     74     .rename(columns=rename_columns)
   (...)
     80     )
     81 )
     83 fpl_base, *_ = decoded["fpl"]["fpl_base"]

File ~/miniconda3/envs/traffic/lib/python3.10/site-packages/pandas/core/frame.py:4937, in DataFrame.eval(self, expr, inplace, **kwargs)
   4934     kwargs["target"] = self
   4935 kwargs["resolvers"] = tuple(kwargs.get("resolvers", ())) + resolvers
-> 4937 return _eval(expr, inplace=inplace, **kwargs)

File ~/miniconda3/envs/traffic/lib/python3.10/site-packages/pandas/core/computation/eval.py:357, in eval(expr, parser, engine, local_dict, global_dict, resolvers, level, target, inplace)
    355 eng = ENGINES[engine]
    356 eng_inst = eng(parsed_expr)
--> 357 ret = eng_inst.evaluate()
    359 if parsed_expr.assigner is None:
    360     if multi_line:

File ~/miniconda3/envs/traffic/lib/python3.10/site-packages/pandas/core/computation/engines.py:81, in AbstractEngine.evaluate(self)
     78     self.result_type, self.aligned_axes = align_terms(self.expr.terms)
     80 # make sure no names in resolvers and locals/globals clash
---> 81 res = self._evaluate()
     82 return reconstruct_object(
     83     self.result_type, res, self.aligned_axes, self.expr.terms.return_type
     84 )

File ~/miniconda3/envs/traffic/lib/python3.10/site-packages/pandas/core/computation/engines.py:121, in NumExprEngine._evaluate(self)
    119 scope = env.full_scope
    120 _check_ne_builtin_clash(self.expr)
--> 121 return ne.evaluate(s, local_dict=scope)

File ~/miniconda3/envs/traffic/lib/python3.10/site-packages/numexpr/necompiler.py:975, in evaluate(ex, local_dict, global_dict, out, order, casting, sanitize, _frame_depth, **kwargs)
    973     return re_evaluate(local_dict=local_dict, _frame_depth=_frame_depth)
    974 else:
--> 975     raise e

File ~/miniconda3/envs/traffic/lib/python3.10/site-packages/numexpr/necompiler.py:877, in validate(ex, local_dict, global_dict, out, order, casting, _frame_depth, sanitize, **kwargs)
    874 arguments = getArguments(names, local_dict, global_dict, _frame_depth=_frame_depth)
    876 # Create a signature
--> 877 signature = [(name, getType(arg)) for (name, arg) in
    878             zip(names, arguments)]
    880 # Look up numexpr if possible.
    881 numexpr_key = expr_key + (tuple(signature),)

File ~/miniconda3/envs/traffic/lib/python3.10/site-packages/numexpr/necompiler.py:877, in <listcomp>(.0)
    874 arguments = getArguments(names, local_dict, global_dict, _frame_depth=_frame_depth)
    876 # Create a signature
--> 877 signature = [(name, getType(arg)) for (name, arg) in
    878             zip(names, arguments)]
    880 # Look up numexpr if possible.
    881 numexpr_key = expr_key + (tuple(signature),)

File ~/miniconda3/envs/traffic/lib/python3.10/site-packages/numexpr/necompiler.py:717, in getType(a)
    715 if kind == 'U':
    716     raise ValueError('NumExpr 2 does not support Unicode as a dtype.')
--> 717 raise ValueError("unknown type %s" % a.dtype.name)

I tried again in a clean conda environment and the error disappeared. It probably was caused by another package that installed the "numexpr" engine. When calling .eval() this is the default evaluation engine (see: pandas docs) and it falls back to "python" when the "numexpr" engine is not found. Some expressions do not seem to be supported by numexpr, therefore to avoid similar issues, I would suggest to explicitly set the engine to "python" for all cases where the numexpr engine will fail. (See also this discussion). This is what fixed the error for me:

Example from SCAT:

flight_plan = (
    pd.json_normalize(decoded["fpl"]["fpl_plan_update"])
    .rename(columns=rename_columns)
    .eval(
        """
    timestamp = @pd.to_datetime(timestamp, utc=True, format="mixed")
    flight_id = @flight_id
    """, engine="python"
    )
)

Alternatively, one could also use python code for these cases directly. This might be more work to implement but ultimately the best option, as it also increases readability and errors are easier to debug. What do you think?

@niclaswue niclaswue added the bug Something isn't working label May 15, 2024
@xoolive
Copy link
Owner

xoolive commented May 15, 2024

I am not really sure, I had issues in the past when settings the engine explicitly. Maybe it would be smarter in general (in the whole code base) to add the numexpr dependency and adjust the queries where things fail, maybe easier to maintain...

For this specific line you are pointing at, I don't mind adding it. Happy to accept a pull request if you want to give it a shot. (Please run ruff on the file before committing, otherwise CI breaks...)

@niclaswue
Copy link
Contributor Author

Yes, that would be a good idea as well. My understanding is that eval is only really useful with the numexpr engine anyway. The pandas documentation states:

'python' : Performs operations as if you had eval’d in top level python. This engine is generally not that useful.

Therefore I would ultimately prefer python code instead of engine="python".

Steps could be:

  1. Add numexpr with specific version as dependency
  2. See what breaks
  3. Rewrite broken stuff in python code

Anyways, for the above, I added a PR.

@xoolive
Copy link
Owner

xoolive commented May 17, 2024

Yes thank you, I will leave the issue open in order to remember about that and deal with it when I can.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants