Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/expectation maximization #1424

Open
wants to merge 13 commits into
base: dev
Choose a base branch
from

Conversation

kyjohnso
Copy link

Your checklist for this pull request

Please review the guidelines for contributing to this repository.

  • Make sure you are requesting to pull a topic/feature/bugfix branch (right side). Don't request your master!
  • Make sure you are making a pull request against the dev branch (left side). Also you should start your branch off our dev.
  • Check the commit's or even all commits' message styles matches our requested structure.

Issue number(s) that this pull request fixes

  • Fixes #

List of changes to the codebase in this pull request

@kyjohnso
Copy link
Author

kyjohnso commented Jun 10, 2021

I would be happy to consolidate with #1420
I think this one is different because it allows for any of the variables in each sample to be unobserved rather than a single variable being "latent".

My implementation is a bit different in that it iterates over each cpd in series (1 at a time), rather than updating all of them per iteration. I think this can be adjusted pretty simply -

@codecov
Copy link

codecov bot commented Jun 10, 2021

Codecov Report

Merging #1424 (49dce38) into dev (82911bf) will decrease coverage by 0.43%.
The diff coverage is 14.28%.

Impacted file tree graph

@@            Coverage Diff             @@
##              dev    #1424      +/-   ##
==========================================
- Coverage   93.93%   93.50%   -0.44%     
==========================================
  Files         134      135       +1     
  Lines       14150    14227      +77     
==========================================
+ Hits        13292    13303      +11     
- Misses        858      924      +66     
Impacted Files Coverage Δ
pgmpy/estimators/EM.py 13.15% <13.15%> (ø)
pgmpy/estimators/__init__.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 82911bf...49dce38. Read the comment docs.

@ankurankan
Copy link
Member

@kyjohnso Thanks for the PR. Yes, I think it would be a great idea to merge our two implementations. I had a quick look at your code and I have a few comments. Please correct me if I am misunderstanding anything.

  1. I like that your implementation can work for any missing value instead of just specified latents and it's a nice feature to have.
  2. You are using variable elimination to compute the weights for the samples. This would be computationally super expensive to do, in my implementation I used a more efficient way to compute it. Since we have all the variables as evidence, we can just select the values of those states from the CPDs and multiply them. This would be equivalent to running inference and will be computationally much cheaper.
  3. I think rather than asking the user to start with a prior distribution, we should just generate random distributions for the starting values.

In my opinion, the best ways to move forward would be to maybe use my implementation (as I have some optimizations implemented: inference and value caching) and add the extra features from your implementation on top of that. What do you think?

@kyjohnso
Copy link
Author

@kyjohnso Thanks for the PR. Yes, I think it would be a great idea to merge our two implementations. I had a quick look at your code and I have a few comments. Please correct me if I am misunderstanding anything.

hi @ankurankan yea I agree that your approach is the best way forward. Here are my thoughts:

1. I like that your implementation can work for any missing value instead of just specified latents and it's a nice feature to have.
  1. yes I really like the feature of having the option of multiple and different unobserved variables from sample to sample.
2. You are using variable elimination to compute the weights for the samples. This would be computationally super expensive to do, in my implementation I used a more efficient way to compute it. Since we have all the variables as evidence, we can just select the values of those states from the CPDs and multiply them. This would be equivalent to running inference and will be computationally much cheaper.
  1. I hadn't really thought about performance yet and I agree that variable elimination will be way more expensive than your approach of multiplying the CPDs. I would want to make sure that each data sample could have multiple unobserved variables, which I think can still be implemented with the CPD product that you use.
3. I think rather than asking the user to start with a prior distribution, we should just generate random distributions for the starting values.
  1. I think having the option of specifying a prior distribution is useful, I have an application where we are updating priors specified by domain experts and then I want to update these using EM. But yes, if the CPDs aren't passed in the model we could just generate random or uniform distributions.

In my opinion, the best ways to move forward would be to maybe use my implementation (as I have some optimizations implemented: inference and value caching) and add the extra features from your implementation on top of that. What do you think?

Yes, lets move forward with your implementation and then add the sample by sample latent variables. Do you want me to wait until your pull request is approved and then I can help update it? Also, I can help with unit tests and example ipynb's or certainly help with the core EM work as well.

Thanks for the quick response and I look forward to working with you -

@ankurankan
Copy link
Member

@kyjohnso I think we are on the same page about the implementation then. I agree with your idea of having the option of specifying a prior distribution as well, else use random distributions. If you would like could you maybe have a look at my PR as there's some bug that I am not able to figure out? In my implementation, the values always converge after the 2nd iteration which I don't think is correct. I will try to explain the steps of my implementation briefly to make it easier to understand if you go through it:

  1. Start by generating random CPDs. (get_paramters method)
  2. In each iteration, for each sample (datapoint / row) create extra samples with each possible state of the missing variable and compute the weight for it. I have implemented caching here since a lot of the samples would be the same, so we can reuse the weights in that case. (_compute_weights method) and (_get_likelihood method: this computes the weight for each sample; implements the CPD multiplication logic).
  3. Use MaximumLikelihoodEstimator (ML) to estimate the parameters from this new "expanded" dataset but with using the weights instead of giving equal weights to each sample.
  4. Check convergence and repeat.

Also, I am using the following two models for testing the implementation:

from pgmpy.models import BayesianModel
from pgmpy.factors.discrete import TabularCPD
from pgmpy.estimators import ExpectationMaximization as EM
from pgmpy.sampling import BayesianModelSampling
from pgmpy.utils import get_example_model

# Example model 1
# 
# model = BayesianModel([("A", "C"), ("B", "C"), ("C", "D"), ("A", "D")], latents=["A"])
# cpd_a = TabularCPD("A", 2, [[0.4], [0.6]])
# cpd_b = TabularCPD("B", 2, [[0.3], [0.7]])
# cpd_c = TabularCPD(
#     "C",
#     2,
#     [[0.2, 0.3, 0.5, 0.7], [0.8, 0.7, 0.5, 0.3]],
#     evidence=["A", "B"],
#     evidence_card=[2, 2],
# )
# cpd_d = TabularCPD(
#     "D",
#     2,
#     [[0.3, 0.8, 0.4, 0.6], [0.7, 0.2, 0.6, 0.4]],
#     evidence=["A", "C"],
#     evidence_card=[2, 2],
# )
# model.add_cpds(cpd_a, cpd_b, cpd_c, cpd_d)
#
# s = BayesianModelSampling(model)
# data = s.forward_sample(10000)
#
# m = BayesianModel([("A", "C"), ("B", "C"), ("C", "D"), ("A", "D")], latents=["A"])
# est = EM(m, data)
# cpds = est.get_parameters()

# Example model 2
model = get_example_model("alarm")
s = BayesianModelSampling(model)
data = s.forward_sample(10000)

m = BayesianModel(model.edges(), latents=["SAO2", "HR", "HRBP", "EXPCO2"])
est = EM(m, data)
cpds = est.get_parameters(latent_card={"SO2": 3, "HR": 3, "HRBP": 3, "EXPCO2": 4})

@kyjohnso
Copy link
Author

@ankurankan yep - I am happy to look at your PR, I will probably have some time in the next couple of days - in the instance that I find something, or figure out how to add the features we wrote about, would you prefer that I fork your fork, or would you rather incorporate my code another way?

@ankurankan
Copy link
Member

@kyjohnso Thanks a lot. I have pushed a new branch feature/em with my code to this repo now: https://github.com/pgmpy/pgmpy/tree/feature/em. I think we can both work on this branch (you can open PRs with your changes to this branch and I will also push my future changes to it) and once we are done we can merge it back to dev.

@ankurankan
Copy link
Member

@kyjohnso Hi, I don't know if you got a chance to look at my implementation yet. But I was going through it again and seems like it was working fine all along. For fully observed datasets, the learned parameters are quite accurate. With latent variables, it's not very accurate, but I think your implementation also gives similar results if I am not mistaken? I was just expecting it to be more accurate in the latent variable case. I have pushed my latest code, if you would like you can maybe implement your features on that? I will work on optimizing the implementation in the meanwhile.

@ankurankan
Copy link
Member

Ignore my last comment, I finally found the bug. It should be working as expected now.

@kyjohnso
Copy link
Author

hi @ankurankan yea I have been taking a look at your implementation, and got a quick start of running the example model you sent in the comments, I will pull the new code and start implementing my features on that -

@kyjohnso
Copy link
Author

@ankurankan are you still using the models that you posted above to test your implementation?

@ankurankan
Copy link
Member

@kyjohnso Yes, with an extra atol=0.001 argument, so that it doesn't run for max iter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants