Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Add DataFrames as a weak dependency #3441

Closed
wants to merge 1 commit into from
Closed

Conversation

odow
Copy link
Member

@odow odow commented Aug 3, 2023

Closes #3438

I started working on this, but I don't know whether I like it or not.

I think I prefer the approach of explicitly constructing a data frame, and then adding a new column which is a vector of variables. It's more explicit and gives the user control over what the call the new column. This current approach would likely also be used naively to construct sparse sets which doesn't fix the loop problem.

@codecov
Copy link

codecov bot commented Aug 3, 2023

Codecov Report

Patch coverage: 100.00% and no project coverage change.

Comparison is base (a325eb6) 98.01% compared to head (9164ea7) 98.01%.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #3441   +/-   ##
=======================================
  Coverage   98.01%   98.01%           
=======================================
  Files          36       38    +2     
  Lines        5039     5050   +11     
=======================================
+ Hits         4939     4950   +11     
  Misses        100      100           
Files Changed Coverage Δ
ext/JuMPDataFramesExt.jl 100.00% <100.00%> (ø)
ext/test_DataFrames.jl 100.00% <100.00%> (ø)

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

function test_dimension_data_variable()
model = Model()
@variable(model, x[i = 2:4, j = 1:2], container = DataFrame)
@variable(model, y[i = 2:4, j = 1:2; isodd(i+j)], container = DataFrame)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The more I think about this, the less I like it. I prefer the explicit "add a new column to an existing dataframe" than the "construct a dataframe from these indices." The latter approach is just going to encourage the same nested loops as the GAMS example.

@blegat
Copy link
Member

blegat commented Aug 11, 2023

Ideally, we should parse the condition of the JuMP.Containers.NestedIterator then it would be an expression tree of conditions with && and ||. Then we put it into conjunctive normal form c1 && c2 && ... && cm. Then for each condition, we should check:

  1. which indices does it involve ?
  2. is this an equality lhs == rhs

If it involves a subset of the indices then we should filter with this condition earlier and not in the most nested level of the for loops.
If it is an equality, we could do the indexing ourself (as in https://discourse.julialang.org/t/is-it-possible-to-incorporate-the-relational-algebra-technique-into-jump-in-the-future-to-expedite-model-generation/101343/4?u=blegat) or use DataFrames but that's kind of an heavy dependency for just this.
Now what's a bit tricky is that you could have [i = I, j = J, k = K, l = L; f(i, j, k) == g(j, k, l)]. Then, it means you need to build the set of (i, j, k) and the set of (j, k, l) and then index them with the Dict to take the intersection. One could wonder if it wouldn't be better to just to the 4-nested for loops instead of having to do twice a 3-nested for loops. I think however than unless K has only 2 elements, it is always better to do two 3-nested for-loops than a 4-nested for loops so doing this intersection with dictionaries is always better.

In some cases, it might be better to do the nested for loops in an order that is different from the order of the indices used by the user but that can be left as future work since I don't think this is easy to find that. But detecting the two things I mentioned above and improving the iteration shouldn't be too complicated and it would solve the issue in the GAMS example without the user to do anything. The SparseAxisArray would automatically build itself efficiently.

@odow
Copy link
Member Author

odow commented Aug 14, 2023

detecting the two things I mentioned above and improving the iteration shouldn't be too complicated

This is getting a bit too magical. It also assumes that constructing the index sets and a dictionary solves the performance problem.

I don't think we need to "fix" the GAMS example. We need to encourage people to rethink how they view models. Nested for-loops are not the best way to conceptualize the IJKLM model.

@blegat
Copy link
Member

blegat commented Aug 14, 2023

This is getting a bit too magical. It also assumes that constructing the index sets and a dictionary solves the performance problem.

It makes the complexity depends linearly on the number of tuples satisfying the condition. It might be surprising to the user that we build the SparseAxisArray so naively while he gave us all the information in a macro that are necessary to take into account its sparsity.

Following this PR, how would the user solve the IJKLM issue ?

@odow
Copy link
Member Author

odow commented Aug 14, 2023

Following this PR, how would the user solve the IJKLM issue ?

They wouldn't. That's why I said I don't really like this PR.

@odow
Copy link
Member Author

odow commented Aug 14, 2023

Closing for now. But I'll leave #3438 open, and I'll add @blegat's comment to the issue. He has some CS things in mind.

@odow odow closed this Aug 14, 2023
@odow odow deleted the od/dataframes branch August 14, 2023 19:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Improve support for relational algebra
2 participants