Feature/expectation maximization #1424

kyjohnso · 2021-06-10T16:46:36Z

Your checklist for this pull request

Please review the guidelines for contributing to this repository.

Make sure you are requesting to pull a topic/feature/bugfix branch (right side). Don't request your master!
Make sure you are making a pull request against the dev branch (left side). Also you should start your branch off our dev.
Check the commit's or even all commits' message styles matches our requested structure.

Issue number(s) that this pull request fixes

Fixes #

List of changes to the codebase in this pull request

…ion-maximization

…n to uniform

kyjohnso · 2021-06-10T16:51:18Z

I would be happy to consolidate with #1420
I think this one is different because it allows for any of the variables in each sample to be unobserved rather than a single variable being "latent".

My implementation is a bit different in that it iterates over each cpd in series (1 at a time), rather than updating all of them per iteration. I think this can be adjusted pretty simply -

…o/pgmpy into feature/expectation-maximization

codecov · 2021-06-10T17:32:58Z

Codecov Report

Merging #1424 (49dce38) into dev (82911bf) will decrease coverage by 0.43%.
The diff coverage is 14.28%.

@@            Coverage Diff             @@
##              dev    #1424      +/-   ##
==========================================
- Coverage   93.93%   93.50%   -0.44%     
==========================================
  Files         134      135       +1     
  Lines       14150    14227      +77     
==========================================
+ Hits        13292    13303      +11     
- Misses        858      924      +66

Impacted Files	Coverage Δ
pgmpy/estimators/EM.py	`13.15% <13.15%> (ø)`
pgmpy/estimators/__init__.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 82911bf...49dce38. Read the comment docs.

ankurankan · 2021-06-11T12:11:04Z

@kyjohnso Thanks for the PR. Yes, I think it would be a great idea to merge our two implementations. I had a quick look at your code and I have a few comments. Please correct me if I am misunderstanding anything.

I like that your implementation can work for any missing value instead of just specified latents and it's a nice feature to have.
You are using variable elimination to compute the weights for the samples. This would be computationally super expensive to do, in my implementation I used a more efficient way to compute it. Since we have all the variables as evidence, we can just select the values of those states from the CPDs and multiply them. This would be equivalent to running inference and will be computationally much cheaper.
I think rather than asking the user to start with a prior distribution, we should just generate random distributions for the starting values.

In my opinion, the best ways to move forward would be to maybe use my implementation (as I have some optimizations implemented: inference and value caching) and add the extra features from your implementation on top of that. What do you think?

kyjohnso · 2021-06-11T20:39:51Z

@kyjohnso Thanks for the PR. Yes, I think it would be a great idea to merge our two implementations. I had a quick look at your code and I have a few comments. Please correct me if I am misunderstanding anything.

hi @ankurankan yea I agree that your approach is the best way forward. Here are my thoughts:

1. I like that your implementation can work for any missing value instead of just specified latents and it's a nice feature to have.

yes I really like the feature of having the option of multiple and different unobserved variables from sample to sample.

2. You are using variable elimination to compute the weights for the samples. This would be computationally super expensive to do, in my implementation I used a more efficient way to compute it. Since we have all the variables as evidence, we can just select the values of those states from the CPDs and multiply them. This would be equivalent to running inference and will be computationally much cheaper.

I hadn't really thought about performance yet and I agree that variable elimination will be way more expensive than your approach of multiplying the CPDs. I would want to make sure that each data sample could have multiple unobserved variables, which I think can still be implemented with the CPD product that you use.

3. I think rather than asking the user to start with a prior distribution, we should just generate random distributions for the starting values.

I think having the option of specifying a prior distribution is useful, I have an application where we are updating priors specified by domain experts and then I want to update these using EM. But yes, if the CPDs aren't passed in the model we could just generate random or uniform distributions.

In my opinion, the best ways to move forward would be to maybe use my implementation (as I have some optimizations implemented: inference and value caching) and add the extra features from your implementation on top of that. What do you think?

Yes, lets move forward with your implementation and then add the sample by sample latent variables. Do you want me to wait until your pull request is approved and then I can help update it? Also, I can help with unit tests and example ipynb's or certainly help with the core EM work as well.

Thanks for the quick response and I look forward to working with you -

ankurankan · 2021-06-12T09:48:44Z

@kyjohnso I think we are on the same page about the implementation then. I agree with your idea of having the option of specifying a prior distribution as well, else use random distributions. If you would like could you maybe have a look at my PR as there's some bug that I am not able to figure out? In my implementation, the values always converge after the 2nd iteration which I don't think is correct. I will try to explain the steps of my implementation briefly to make it easier to understand if you go through it:

Start by generating random CPDs. (get_paramters method)
In each iteration, for each sample (datapoint / row) create extra samples with each possible state of the missing variable and compute the weight for it. I have implemented caching here since a lot of the samples would be the same, so we can reuse the weights in that case. (_compute_weights method) and (_get_likelihood method: this computes the weight for each sample; implements the CPD multiplication logic).
Use MaximumLikelihoodEstimator (ML) to estimate the parameters from this new "expanded" dataset but with using the weights instead of giving equal weights to each sample.
Check convergence and repeat.

Also, I am using the following two models for testing the implementation:

from pgmpy.models import BayesianModel
from pgmpy.factors.discrete import TabularCPD
from pgmpy.estimators import ExpectationMaximization as EM
from pgmpy.sampling import BayesianModelSampling
from pgmpy.utils import get_example_model

# Example model 1
# 
# model = BayesianModel([("A", "C"), ("B", "C"), ("C", "D"), ("A", "D")], latents=["A"])
# cpd_a = TabularCPD("A", 2, [[0.4], [0.6]])
# cpd_b = TabularCPD("B", 2, [[0.3], [0.7]])
# cpd_c = TabularCPD(
#     "C",
#     2,
#     [[0.2, 0.3, 0.5, 0.7], [0.8, 0.7, 0.5, 0.3]],
#     evidence=["A", "B"],
#     evidence_card=[2, 2],
# )
# cpd_d = TabularCPD(
#     "D",
#     2,
#     [[0.3, 0.8, 0.4, 0.6], [0.7, 0.2, 0.6, 0.4]],
#     evidence=["A", "C"],
#     evidence_card=[2, 2],
# )
# model.add_cpds(cpd_a, cpd_b, cpd_c, cpd_d)
#
# s = BayesianModelSampling(model)
# data = s.forward_sample(10000)
#
# m = BayesianModel([("A", "C"), ("B", "C"), ("C", "D"), ("A", "D")], latents=["A"])
# est = EM(m, data)
# cpds = est.get_parameters()

# Example model 2
model = get_example_model("alarm")
s = BayesianModelSampling(model)
data = s.forward_sample(10000)

m = BayesianModel(model.edges(), latents=["SAO2", "HR", "HRBP", "EXPCO2"])
est = EM(m, data)
cpds = est.get_parameters(latent_card={"SO2": 3, "HR": 3, "HRBP": 3, "EXPCO2": 4})

kyjohnso · 2021-06-13T16:23:20Z

@ankurankan yep - I am happy to look at your PR, I will probably have some time in the next couple of days - in the instance that I find something, or figure out how to add the features we wrote about, would you prefer that I fork your fork, or would you rather incorporate my code another way?

ankurankan · 2021-06-14T07:30:07Z

@kyjohnso Thanks a lot. I have pushed a new branch feature/em with my code to this repo now: https://github.com/pgmpy/pgmpy/tree/feature/em. I think we can both work on this branch (you can open PRs with your changes to this branch and I will also push my future changes to it) and once we are done we can merge it back to dev.

ankurankan · 2021-06-17T15:20:32Z

@kyjohnso Hi, I don't know if you got a chance to look at my implementation yet. But I was going through it again and seems like it was working fine all along. For fully observed datasets, the learned parameters are quite accurate. With latent variables, it's not very accurate, but I think your implementation also gives similar results if I am not mistaken? I was just expecting it to be more accurate in the latent variable case. I have pushed my latest code, if you would like you can maybe implement your features on that? I will work on optimizing the implementation in the meanwhile.

ankurankan · 2021-06-17T16:47:54Z

Ignore my last comment, I finally found the bug. It should be working as expected now.

kyjohnso · 2021-06-17T20:37:11Z

hi @ankurankan yea I have been taking a look at your implementation, and got a quick start of running the example model you sent in the comments, I will pull the new code and start implementing my features on that -

kyjohnso · 2021-06-17T21:34:38Z

@ankurankan are you still using the models that you posted above to test your implementation?

ankurankan · 2021-06-17T21:45:41Z

@kyjohnso Yes, with an extra atol=0.001 argument, so that it doesn't run for max iter.

kyjohnso added 9 commits February 6, 2021 14:16

first commit of EM.py and update __init__.py

cd97db0

first commit for EM

e3fda15

complete rewrite of EM.py

eddd533

Merge branch 'dev' of github.com:kyjohnso/pgmpy into feature/expectat…

4078b05

…ion-maximization

fix inference -> estimators -> inference circular import problem

b5d4fe0

add the moveaxis helper function

4d98506

fix a bug where an errant e poped in from somewhere

f1b95aa

for columns in the cpd that don't have any counts set the distributio…

b4e742d

…n to uniform

Merge branch 'pgmpy:dev' into feature/expectation-maximization

ec1900f

kyjohnso added 3 commits June 10, 2021 17:58

first commit of EM example

5807e04

Merge branch 'feature/expectation-maximization' of github.com:kyjohns…

54e3b02

…o/pgmpy into feature/expectation-maximization

ran black on EM.py

2b6ab4f

Merge branch 'pgmpy:dev' into feature/expectation-maximization

49dce38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/expectation maximization #1424

Feature/expectation maximization #1424

kyjohnso commented Jun 10, 2021

kyjohnso commented Jun 10, 2021 •

edited

codecov bot commented Jun 10, 2021 •

edited

ankurankan commented Jun 11, 2021

kyjohnso commented Jun 11, 2021

ankurankan commented Jun 12, 2021

kyjohnso commented Jun 13, 2021

ankurankan commented Jun 14, 2021

ankurankan commented Jun 17, 2021

ankurankan commented Jun 17, 2021

kyjohnso commented Jun 17, 2021

kyjohnso commented Jun 17, 2021

ankurankan commented Jun 17, 2021

Feature/expectation maximization #1424

Are you sure you want to change the base?

Feature/expectation maximization #1424

Conversation

kyjohnso commented Jun 10, 2021

Your checklist for this pull request

Issue number(s) that this pull request fixes

List of changes to the codebase in this pull request

kyjohnso commented Jun 10, 2021 • edited

codecov bot commented Jun 10, 2021 • edited

Codecov Report

ankurankan commented Jun 11, 2021

kyjohnso commented Jun 11, 2021

ankurankan commented Jun 12, 2021

kyjohnso commented Jun 13, 2021

ankurankan commented Jun 14, 2021

ankurankan commented Jun 17, 2021

ankurankan commented Jun 17, 2021

kyjohnso commented Jun 17, 2021

kyjohnso commented Jun 17, 2021

ankurankan commented Jun 17, 2021

kyjohnso commented Jun 10, 2021 •

edited

codecov bot commented Jun 10, 2021 •

edited