Skip to content

1.8.0

Compare
Choose a tag to compare
@ksatchit ksatchit released this 15 Sep 17:18
b8b4ade

New Features & Enhancements

  • Introduces the alpha-0 version of Litmus Portal. The portal helps you to execute & visualize chaos workflows, amongst many other things. Learn more about it here

  • Extends Litmus Probes with “Continuous” mode to validate the hypothesis around application behavior during chaos execution as against just at specific points/phases (start & end of chaos)

  • Adds Node & Pod level I/O stress chaos experiments with the ability to tune worker threads and filesystem usage, to the generic experiment suite.

  • Supports network chaos on Containerd & CRI-O runtimes, in addition to Docker.

  • Supports network chaos between distinct microservices (in addition to total interface level egress traffic chaos) specified by their IPs or hostnames/service FQDNs

  • Enhances the ChaosSchedule schema for repeat mode by adding IncludedHours & IncludedDays. The StartTime/EndTime definitions have been made optional to allow flexibility in being able to run from the point of creation of schedule CR or indefinitely until removal.

  • Migrates Cassandra ring disruption experiment to go-based chaoslib

  • Adds the ability to specify a target pod (env: TARGET_POD) or node (env: APP_NODE) as the application/resource under test, apart from randomized selections based on labels.

  • Enables the definition of blast radius for an application as a percentage value (PODS_AFFECTED_PERCENTAGE), by which an appropriate number of replicas undergo the specified chaos in parallel.

  • Improves the litmus chaoslib to take container fs & runtime socket file paths as tunables to support different Kubernetes platforms

  • Includes an additional pumba-based chaoslib for cpu/memory stress that uses external chaos containers (non-pod exec mode)

  • Adds chaos command tunables (for chaos injection & revert) for cpu/memory chaoslib (in pod exec mode) - in order to cover different base images & distros.

  • Supports broader filtering of pods within a namespace when no application labels are provided in .spec.appInfo. Users can also choose to skip the specification of application namespace explicitly, in which case the target pods are selected randomly from the ChaosEngine resource namespace.

  • Modifies the litmus chaos containers (operator, runner) to run with non-root users

  • Allows the definition of an INSTANCE_ID in the ChaosEngine to provide additional context or metadata to an experiment run. This also aids the creation of newer ChaosResult resources instead of patching/overwriting existing ones in case of repeated executions.

  • Improves the experiment code standards by fixing the issues listed in the GoGitOps report card for the litmus-go repository.

  • Generates events against the ChaosResult resource to indicate the experiment verdict (Pass, Fail, Stopped). These are useful in annotating monitoring dashboards with experiment results.

  • Enhances the Chaos Exporter to push chaos metrics to AWS CloudWatch

  • Improves the kubernetes-chaos helm chart by including options in the values.yaml to selectively install experiments via a whitelist/blacklist. Also maps the experiment names to reflect those on the ChaosHub.

  • Enhances the litmus-e2e with increased reporting around component-tests, the addition of e2e tests for new experiments, and Docker-based Gitlab runner for litmus-portal pipelines

  • Provides additional documentation based on experiment enhancements. Updates the get started documentation for general Kubernetes/OpenShift/Rancher platforms.

  • Enhances the litmus-demo scripts to generate a pdf report for the chaos experiments executed

  • Operationalizes the Litmus community Special Interest Groups (SIGs) for Documentation, Observability & Integrations.

Major Bug Fixes

  • Constructs ChaosResult name using experiment names passed from the ChaosExperiment resource instead of hardcoded experiment names

  • Fixes the chaos verification (whether chaos injection has occurred) steps in the container-kill experiment & retains the helper containers in case of errors for further debugging

  • Fixes the chaos event messages to be meaningful & include probe information only when the probes are defined

  • Removes the need for privileged containers to execute disk-fill chaos experiment

  • Handles the case where cpu/memory hog chaos processes are terminated or the target containers are OOM-Killed (this typically occurs when the memory hog/injection value exceeds resource limits set against the pods/containers). The error code 137 is handled appropriately with warning logs and the experiment proceeds with verification steps instead of erroring out/failing (the OOM-Kill is an expected behavior based on inputs provided)

  • Fixes the behavior in node-memory hog experiments where the provided input (percentage of node memory) is measured against the available memory instead of the total system memory

  • Propagates the custom chaos experiment annotations provided in the ChaosExperiment to the helper pods, if any. This is especially useful in cases where annotations decide scheduling or are mapped to certain IAM role/accounts etc.,

Deprecations & Breaking Changes

  • The instance count (.spec.schedule.instanceCount) property on the chaosSchedule has been deprecated in favor of maintaining just the minChaosInterval as a means of defining chaos cadence.

Major Known Issues & Limitations

Issue

  • The network chaos experiments (especially on docker runtime, using the litmus pumba lib) can end up with a Failed ChaosResult, and the app stuck in CrashLoopBackoff state in case of application deployments configured with liveness probes (that are set up to access health/service endpoints). Typically, this lib injects the tc netem rule against the interface by running a “chaos container” that attaches to the network namespace of the target container via the target’s container ID. The same ID is used in a subsequent container launched to revert the rule/chaos. However, with liveness probes, the container is restarted several times during the course of the chaos duration, causing the ID to change. The revert fails, with the network rule still persisting (courtesy the Kubernetes pause container for this app pod) leading to the app entering a CrashLoopBackOff state.

Current Workaround

  • Delete/reschedule the target pod manually to recreate the pause container/network namespace.
  • Use Target IPs or Hosts to inject the chaos b/w specific microservices while keeping the probe alive.

Note: This is expected to be fixed in a 1.8.x patch release

Issue

  • The kubelet-service-kill experiment makes use of systemctl to stop/start the service today. Running this experiment w/o an external LIB_IMAGE & leveraging the experiment image can throw the error Failed to connect to bus: No data available as the experiment runs with a non-root user.

Current Workaround

  • A standard Ubuntu image that runs as root can be used in a “helper” pod that injects this chaos. However, user-discretion is advised in terms of providing this access.

Issue

  • The pod-cpu-hog & pod-memory-hog experiments that run in the pod-exec mode (which is typically used when the users don’t want to mount runtime’s socket files on their pods) using the default lib can tend to fail, in spite of chaos being injected successfully - due to the unavailability of certain default utils (that is used for detecting the chaos process and killing them/reverting chaos at the end of the chaos duration) in the target’s image.

Workaround

  • Users can identify the necessary commands to derive and kill the chaos PIDs and pass them to the experiment via env variable CHAOS_KILL_COMMAND

  • Alternatively, they can make use of the chaos lib that uses external containers with SYS_ADMIN docker capability to inject/revert the chaos, while mounting the runtime socket file. Note that this is supported only on docker at this point.

Note: This is expected to be fixed in a 1.8.x patch release

Installation

kubectl apply -f https://litmuschaos.github.io/litmus/litmus-operator-v1.8.0.yaml

Verify your installation

  • Verify if the chaos operator is running
    kubectl get pods -n litmus

  • Verify if chaos CRDs are installed
    kubectl get crds | grep chaos

For more details refer to the documentation at Docs