Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CPU spike fault not injecting #82

Open
jv-frechstest opened this issue Oct 11, 2021 · 14 comments
Open

CPU spike fault not injecting #82

jv-frechstest opened this issue Oct 11, 2021 · 14 comments
Assignees

Comments

@jv-frechstest
Copy link

Describe the bug
Mangle is deployed on OpenShift container. Trying to inject CPU spike fault to clusterK8 endpoint service.
We are not able to spike the CPU to 90%. The CPU spikes very little compared to target we want.

To Reproduce
Steps to reproduce the behavior:

  1. Go to 'Mangle UI'
  2. Click on 'Fault Exception'
  3. Click on 'CPU ',
  4. Fill the Target cluster and JVM information, and put CPU to 100%
  5. Go to Requests and Response. you will see the request is executed successfully.

Expected behavior
CPU spiked to 100% on target instance.

Screenshots
If applicable, add screenshots to help explain your problem.

Logs
If applicable, add application logs/troubleshooting bundle to help in root cause analysis.

Configuration information:

  • Deployment Type: [e.g. OVA, Container] - Target API on k8s Container
  • Deployment Mode: [e.g. Cluster, e] - Cluster
  • Client OS: [e.g. iOS] : Open Shift
  • Client Browser [e.g. chrome, safari, Swagger] - Chrome
  • Version [e.g. 22] 93

Additional context
Add any other context about the problem here.

@ashrimalivmware
Copy link
Contributor

How much CPU spike you see after requesting 100%? How you are measuring the spike?

@ashrimalivmware
Copy link
Contributor

How much CPU spike you see after requesting 100%? How you are measuring the spike?

HI @jv-frechstest can you please provide more details about it?

@ashrimalivmware ashrimalivmware self-assigned this Dec 17, 2021
@ashrimalivmware
Copy link
Contributor

Since there are no update closing the ticket, @jv-frechstest please re-open if required.

@Anvesh42
Copy link

Hi. Our team has adopted the mangle product in our organization. Its running as a docker image on the Kubernetes cluster. We have having similar issues with CPU_SPIKE & MEMORY_SPIKE. When we inject the 100% chaos on wither CPU or MEMORY spike, we only see about 60% injection.
We did try few times reaching out to mangle vmware support at mangle@vmware.com but we haven't been able to see any response.
Is this something that the mangle team can help us with?

@rpraveen-vmware
Copy link
Contributor

Hi @Anvesh42 ,
As mentioned in the earlier comment, can you please specify how you are measuring the spike.
As provided in details above, you are performing the Application level CPU/memory fault rite.
Do you observe the similar issue when you perform the Infrastructure level faults..

@Anvesh42
Copy link

@rpraveen-vmware Thanks for the response. So let me sum up the challenges/issues our team is facing with mangle before getting into the execution details and further scenarios that were run.

  1. Application Level CPU & MEMORY Faults
    At the application level, these faults expects a JVM process ID as a mandatory input. We have a Kubernetes environment where the deployment of a POD can land on any node in the Kubernetes cluster. Now, if there are 5 POD's of a given microservice, the JVM PID will be different on each POD correct. In our case, we had 3 POD's with one JVM ID, lets say 15, and the other 2 with, lets say, 13. All these 5 POD's belong to the same microservice.
    So now lets say if we were to inject the CPU spike fault on all the POD's of the microservice, foo, using the labels, app=foo, how do we overcome this obstacle? Based on the above scenario, we can only chose 13 or 15 as the JVM process.

How do we address the injection of CPU fault at the application level in this scenario? Our approach with all the faults has been to use the labels instead of a specific container so all the POD's matching that label are in play, instead of a specific container.

  1. Infrastructure Level CPU & MEMORY Faults
    In contrast to application level CPU fault, the infrastructure level CPU fault doesn't expect the JVM ID and other few arguments, but its at the infrastructure level impacting all the processes rather than just the microservice specific process. I am not clear on whether infrastructure level faults can be used in place of application level for the CPU & MEMORY.

I would like to hear your suggestions on this?

The spike is injected at the infra level with the infrastructure faults and at the specific JVM process with the application fault. It appears to me that application level CPU & MEMORY faults is more specific to a particular JVM process which is good. I mean, if there are, lets say, 5 different services running on a given node within the cluster, running infrastructure fault will impact all of them.

  1. Measuring CPU & MEMORY Faults
    In response to your question, we are using Grafana dashboard to measure the CPU spike. However we also use the top command to observe the spike in a given container

Your response is appreciated. Thanks!

@rpraveen-vmware
Copy link
Contributor

@Anvesh42

  1. For the Application level CPU and memory faults,
    The mandatory parameter JVM process can be Process Id or the JVM process descriptor name.
    Incase of the multiple pods, since the process Ids will be different, you can go for the second option.
    Use "jps" command to list the java processes with its descriptor name.
    eg:
    jps
    1296 Jps
    1 LintApplication

    This name (eg. LintApplication) will remain same for this java process across pods.

    1. It depends on your usecase on specific testing.
      The Application level CPU/memory targets the specific JVM process. Hence, It gives the simulation of your running java process causing the CPU/ heap memory spikes.

    Incase of Infrastructure CPU/memory fault, it increases the resources of your machine on a whole.
    So, you can test your application which is hosted on the machine when the resource spikes happen ( simulation of caused by external factors on the machine).

    1. You can use
      kubectl top pod POD_NAME --containers # Show metrics for a given pod and its containers
      kubectl top pod POD_NAME --sort-by=cpu # Show metrics for a given pod and sort it by 'cpu' or 'memory'
      when monitoring resource for pods.

@Anvesh42
Copy link

@rpraveen-vmware Thanks for your inputs Praveen. I am working running these scenarios based on the above pointers.

Meanwhile, we are looking to upgrade mangle from 3.0 to 3.5. I have been told that 3.5 version has the Log4J issue remediation (an issue that happened very recently, few weeks ago)

https://hub.docker.com/layers/mangleuser/mangle/3.5.0/images/sha256-cc8d7c4542a86a942c046e118602db093efa7d7ba529f61845d761a75c1b6f9c?context=explore

I did find the image but I do not see info on the changelog i.e. what has been changed from 3.0 to 3.5. I am not sure if Log4J remediation changes have been added in this version.

Mind throwing some light? Or is there some other place where I could find the changelog?

Thanks
Anvesh

@ashrimalivmware
Copy link
Contributor

Hi @Anvesh42 ,

Mangle 3.5 has following changes:

  1. Integrate Dynatrace as a Metric Provides to Mangle
  2. Enhance Network faults to have a varied latency for the entire timeout
  3. Option to show all the resources of a K8S cluster and provide option to select the required resource for fault injection.
  4. Add a new Fault for K8S Drain Nodes.
  5. Log4j Vulnerability fix.
  6. A much improved Real time polling.

Thanks,
-Avinash

@Anvesh42
Copy link

Anvesh42 commented Feb 28, 2022

@ashrimalivmware @rpraveen-vmware Thanks Avinash.

I did run the tests based on @rpraveen-vmware suggestion to use the jps instead of PID using the current version of mangle. While it did help to some extent & addressed the concern, here are the findings,

  1. I had 4 POD's of a microservice running on namespace. I injected the CPU SPIKE chaos at the application level using the jps argument.
  • The CPU FAULT injected the spike across 3 POD's only out of 4
  • Why did the 4th POD miss out from this execution. It has the same label and the same jps value? (All 4 are replicas)
  1. The injected percentage is still less than the user defined value. This was the primary issue of this thread.
  • The defined value was 80% and the injected value was 50%. Please find the attached snippet of the configuration
  • I re-ran the execution with 60% and 70% and still see only 50% injection.
  • The spike was measured using the top command in the container.

cpu_spike

It would be nice to connect so we can work together and address/improve these issues.

P.S. We use Microsoft Teams in our environment. So we can connect there depending on your availability.

Appreciate your response!

Anvesh

@ashrimalivmware
Copy link
Contributor

@Anvesh42 Yeah it would be better if we can connect, MS teams is fine with us. Please feel free to schedule a call, preferable timings would be post 8:30 AM IST and before 9:30 PM IST.

@Anvesh42
Copy link

Anvesh42 commented Mar 15, 2022

@ashrimalivmware @rpraveen-vmware Please find the attached OpenShift DC objects for sample namespaces, DEV03 & DEV70, that we used during the working session to test the CPU_FAULT spike scenarios.

DEV03 image properties:- RHEL:7.7-openjdk:1.8.0.232

DEV70 image properties:- RHEL:7.9-openjdk:1.8.0.292

We tested the following scenarios by modifying the resources section in each DC (Deployment Config) object.

NOTE:

  • Scenarios 1 & 2 are identical i.e. request is less than limit but with different values (millicore Vs. core)
  • Scenarios 3 & 4 are identical i.e. request is equals to limit but with different values (millicore Vs. core)
  1. CPU request is less than the CPU limit
 - resources:
       limits:
           cpu: '500m'
           memory: 2Gi
        requests:
           cpu: '100m'
           memory: 512Mi
  1. CPU request is less than the CPU limit
 - resources:
       limits:
           cpu: '1'
           memory: 2Gi
        requests:
           cpu: '200m'
           memory: 512Mi
  1. CPU request is equal to the CPU limit (CPU request & limit both are equal to 1 core)
 - resources:
       limits:
           cpu: '1'
           memory: 2Gi
        requests:
           cpu: '1'
           memory: 512Mi
  1. CPU request is equal to the CPU limit (CPU request & limit both are equal to 500 millicore)
 - resources:
       limits:
           cpu: '500m'
           memory: 2Gi
        requests:
           cpu: '500m'
           memory: 512Mi

Command used the measure the CPU spike in the microservice container:- kubectl top pod <POD_NAME> --containers

Observations made:-

  1. With the configuration depicted in the scenario-1, user injected a CPU spike of 80% on both DEV70 and DEV03 POD's.
  • DEV03: The injected spike was always less than the user defined intended value. In most cases, the spike did not cross 50%.
  • DEV70: The injected spike was always less than the user defined intended value. In most cases, the spike did not cross 50%.
  1. With the configuration depicted in the scenario-2, user injected a CPU spike of 80% on both DEV70 and DEV03 POD's.
  • DEV03: The injected spike was always equal to the user defined intended value. Successful scenario
  • DEV70: The injected spike was always equal to the user defined intended value. Successful scenario
  1. With the configuration depicted in the scenario-3, user injected a CPU spike of 80% on both DEV70 and DEV03 POD's.
  • DEV03: The injected spike was always equal to the user defined intended value. Successful scenario
  • DEV70: The injected spike was always equal to the user defined intended value. Successful scenario
  1. With the configuration depicted in the scenario-4, user injected a CPU spike of 80% on both DEV70 and DEV03 POD's.
  • DEV03: The injected spike was always less than the user defined intended value. In most cases, the spike did not cross 50%.
  • DEV70: The injected spike was always less than the user defined intended value. In most cases, the spike did not cross 50%.
  1. Scenarios were successful in all cases where the CPU limit was 1 core irrespective of whether the request was equal or less than the limit - scenarios 2 & 3

  2. Though scenarios 3 & 4 are identical i.e. CPU request and limit are equal, it only works when the values are in core (scenario 3) & doesn't work when the value are in millicore (scenario 4)

  3. Though scenarios 1 & 2 are identical i.e. CPU request is less than the limit, it only works when the limit values are in core (scenario 2) & doesn't work when the value are in millicore (scenario 1)

Please note that whatever fixes or enhancements required for CPU fault, if any, may most likely apply to MEMORY fault as well.

DEV03-DC.txt
DEV70-DC.txt

@Anvesh42
Copy link

@rpraveen-vmware @ashrimalivmware Has there been any update on this? I hope your team was able to replicate the scenarios that we went over during our meeting few weeks ago and also as depicted in detail above.

@rpraveen-vmware
Copy link
Contributor

rpraveen-vmware commented Apr 20, 2022

Hi @Anvesh42 , @ashrimalivmware
We tried to simulate the above scenarios in our k8s environment,
where similar to scenario1, deployed a pod with the cpu/memory configuration: Tried Application CPU spike fault of 80% on the pod.

Limits:
cpu: 1200m
memory: 4000Mi
Requests:
cpu: 900m
memory: 3800Mi

  We did see it crossing 80 percent of cpu spike while checking through kubectl top pod.
  However, we see that you have the pods deployed on openshift container.
  We would need to troubleshoot on this, if is behaviour is environment specific.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants