Skip to content

fix(ksm): set ksm-enabled label to true independently from node pressure #11306

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Feb 28, 2024

Conversation

fossedihelm
Copy link
Contributor

@fossedihelm fossedihelm commented Feb 20, 2024

What this PR does

Before this PR:
Currently, KubeVirt adds the kubevirt.io/ksm-enabled label to those nodes that match the labelSelector defined in kvConfiguration. The value of that label will be true only if that node is under pressure, false otherwise.

After this PR:
This is wrong since we need to provide users the information that a particular node is configured to use ksm, so that it can be used to schedule vms there.

Fixes #
jira-ticket: https://issues.redhat.com/browse/CNV-38568

Why we need it and why it was done in this way

The following tradeoffs were made:

The following alternatives were considered:

Links to places where the discussion took place:

Special notes for your reviewer

Checklist

This checklist is not enforcing, but it's a reminder of items that could be relevant to every PR.
Approvers are expected to review this list.

Release note

fix(ksm): set the `kubevirt.io/ksm-enabled` node label to true if the ksm is managed by KubeVirt, instead of reflect the actual ksm value.

Sorry, something went wrong.

@kubevirt-bot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@kubevirt-bot kubevirt-bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API size/M labels Feb 20, 2024
@fossedihelm
Copy link
Contributor Author

/cc @jean-edouard

@fabiand
Copy link
Member

fabiand commented Feb 20, 2024

particular node is "protected" by ksm

protected might be misleading. Specifically because ksm has a security to it.
Maybe simply "particular node is configured to use ksm
?

@fossedihelm
Copy link
Contributor Author

fossedihelm commented Feb 20, 2024

particular node is "protected" by ksm

protected might be misleading. Specifically because ksm has a security to it. Maybe simply "particular node is configured to use ksm ?

Thanks @fabiand! Looks much better! Adjusted PR description and commit message as well!

@fossedihelm fossedihelm marked this pull request as ready for review February 21, 2024 08:34
@kubevirt-bot kubevirt-bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 21, 2024
@Barakmor1
Copy link
Member

/cc

@@ -249,32 +249,33 @@ func getIntParam(node *v1.Node, param string, defaultValue, lowerBound, upperBou
// will set the outcome value to the n.KSM struct
// If the node labels match the selector terms, the ksm will be enabled.
// Empty Selector will enable ksm for every node
func handleKSM(node *v1.Node, clusterConfig *virtconfig.ClusterConfig) (bool, bool) {
available, running := loadKSM()
func handleKSM(node *v1.Node, clusterConfig *virtconfig.ClusterConfig) (ksmHandlerEnabled, ksmEnabledByUs bool) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that a better name would be shouldSetKsmAsEnabled i was confused by the these names

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that this part is not trivial at all, but IMHO this function will not just enable KSM, but disable it too.
I will leave the rename discussion for a separate/refactoring PR. WDYT? Thanks

Copy link
Member

@Barakmor1 Barakmor1 Feb 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was speaking about the return value ksmHandlerEnabled personally the name really confused me maybe we can fnid a better name than shouldSetKsmAsEnabled But IIUC this var only point is to set the value of this label

so maybe something like ksmLabelValue would make more sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got you! I like the ksmLabelValue.

@Barakmor1
Copy link
Member

Barakmor1 commented Feb 22, 2024

Don't we need to remove or change line 151 in ksm.go to

			if ksm.pages < nPagesMin {
				ksm.pages = nPagesMin
				ksm.running = true
			}

or maybe change line 187 in writeKsmValuesToFiles to:

	err := os.WriteFile(ksmRunPath, []byte([]byte("1\n")), 0644)

To make the changes in this PR description?

@fossedihelm
Copy link
Contributor Author

@Barakmor1 Thanks for the review! I pushed almost all the changes with the suggested code. PTAL
For the other things, I will reply inline. Thank you!

@fossedihelm
Copy link
Contributor Author

Don't we need to remove or change line 151 in ksm.go to

			if ksm.pages < nPagesMin {
				ksm.pages = nPagesMin
				ksm.running = true
			}

or maybe change line 187 in writeKsmValuesToFiles to:

	err := os.WriteFile(ksmRunPath, []byte([]byte("1\n")), 0644)

To make the changes in this PR description?

This PR is just to change the node labelling behavior, not the ksm activation logic.
This piece of code is regarding the activation of ksm, that will occur only on node pressure.

@Barakmor1
Copy link
Member

Barakmor1 commented Feb 22, 2024

Don't we need to remove or change line 151 in ksm.go to

			if ksm.pages < nPagesMin {
				ksm.pages = nPagesMin
				ksm.running = true
			}

or maybe change line 187 in writeKsmValuesToFiles to:

	err := os.WriteFile(ksmRunPath, []byte([]byte("1\n")), 0644)

To make the changes in this PR description?

This PR is just to change the node labelling behavior, not the ksm activation logic. This piece of code is regarding the activation of ksm, that will occur only on node pressure.

If this is the case why do we need this PR?
as i see it it make sense to set kubevirt.io/ksm-enabled: false if ksm is disabled don't you think?

Maybe we should change the name of this label i find it super confusing...

Maybe something like kubevirt.io/ksm-handling: true

@fossedihelm
Copy link
Contributor Author

fossedihelm commented Feb 22, 2024

Don't we need to remove or change line 151 in ksm.go to

			if ksm.pages < nPagesMin {
				ksm.pages = nPagesMin
				ksm.running = true
			}

or maybe change line 187 in writeKsmValuesToFiles to:

	err := os.WriteFile(ksmRunPath, []byte([]byte("1\n")), 0644)

To make the changes in this PR description?

This PR is just to change the node labelling behavior, not the ksm activation logic. This piece of code is regarding the activation of ksm, that will occur only on node pressure.

If this is the case why do we need this PR? as i see it it make sense to set kubevirt.io/ksm-enabled: false if ksm is disabled don't you think?

Because we need to provide users the information that a particular node is configured to use ksm, so that it can be used to schedule vms there.
Basically, we need to label those nodes where kubevirt, in case of node pressure, will activate ksm to reclaim some memory.
Having the kubevirt.io/ksm-enabled to true only when ksm is actually enabled is how it works right now; but this information is a mis-information, because it let people think that that node is a "safer" place to schedule VMs, but actually is the opposite, since it is currently under pressure. Do you agree? Thank you!

@Barakmor1
Copy link
Member

Barakmor1 commented Feb 22, 2024

Don't we need to remove or change line 151 in ksm.go to

			if ksm.pages < nPagesMin {
				ksm.pages = nPagesMin
				ksm.running = true
			}

or maybe change line 187 in writeKsmValuesToFiles to:

	err := os.WriteFile(ksmRunPath, []byte([]byte("1\n")), 0644)

To make the changes in this PR description?

This PR is just to change the node labelling behavior, not the ksm activation logic. This piece of code is regarding the activation of ksm, that will occur only on node pressure.

If this is the case why do we need this PR? as i see it it make sense to set kubevirt.io/ksm-enabled: false if ksm is disabled don't you think?

Because we need to provide users the information that a particular node is configured to use ksm, so that it can be used to schedule vms there. Basically, we need to label those nodes where kubevirt, in case of node pressure, will activate ksm to reclaim some memory. Having the kubevirt.io/ksm-enabled to true only when ksm is actually enabled is how it works right now; but this information is a mis-information, because it let people think that that node is a "safer" place to schedule VMs, but actually is the opposite, since it is currently under pressure. Do you agree? Thank you?

I think that the name of the label is very confusing and we should avoid setting label called kubevirt.io/ksm-enabled to false if ksm is enabled that doesn't make sense IMO

Comment on lines 259 to 277
if ksmConfig == nil && enabled {
disableKSM(node)
return false, false
}

selector, err := metav1.LabelSelectorAsSelector(ksmConfig.NodeLabelSelector)
if err != nil {
log.DefaultLogger().Errorf("An error occurred while converting the ksm selector: %s", err)
return running, false
return false, false
}

if !selector.Matches(labels.Set(node.ObjectMeta.Labels)) {
if disableKSM(node, running) {
return false, false
} else {
return running, false
}
if !selector.Matches(labels.Set(node.ObjectMeta.Labels)) && enabled {
disableKSM(node)
return false, false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need the && enabled ?
I think that we should return from this func even if it is not enabled

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, but we also don't want to write always the same value, when ksm is disabled.
Will adjust! Thank you

@jean-edouard
Copy link
Contributor

@Barakmor1 Yes, that name right now is misleading @acardace @xpivarc @jean-edouard What do you think?

  1. Adding a new label kubevirt.io/ksm-handling alongside the already existing one?
  2. Rename the kubevirt.io/ksm-enbled to kubevirt.io/ksm-handling?
    Thank you!

Neither IMO. We need to restore KSM behavior to what's documented in user-guide, which this PR does, thank you! :)
As far as conveying the information of "is KSM actually running or not?" (i.e. KSM is enabled + there's memory pressure), we could add a label for that if we think it's useful, like ksm-running or ksm-active. I guess it would make sense for some workloads to have an anti-affinity with such a label.
Thanks again for fixing the logic!

@Barakmor1
Copy link
Member

@Barakmor1 Yes, that name right now is misleading @acardace @xpivarc @jean-edouard What do you think?

  1. Adding a new label kubevirt.io/ksm-handling alongside the already existing one?
  2. Rename the kubevirt.io/ksm-enbled to kubevirt.io/ksm-handling?
    Thank you!

Neither IMO. We need to restore KSM behavior to what's documented in user-guide, which this PR does, thank you! :) As far as conveying the information of "is KSM actually running or not?" (i.e. KSM is enabled + there's memory pressure), we could add a label for that if we think it's useful, like ksm-running or ksm-active. I guess it would make sense for some workloads to have an anti-affinity with such a label. Thanks again for fixing the logic!

Sounds good to me, maybe we can think of something in a follow up PR

@Barakmor1
Copy link
Member

Barakmor1 commented Feb 22, 2024

/lgtm

@fossedihelm Thank you
Great work

@kubevirt-bot kubevirt-bot added the lgtm Indicates that a PR is ready to be merged. label Feb 22, 2024
@fossedihelm
Copy link
Contributor Author

Thank you both @jean-edouard @Barakmor1

@kubevirt-bot kubevirt-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 24, 2024

Verified

This commit was signed with the committer’s verified signature.
fossedihelm Federico Fossemò
Currently, KubeVirt adds the `kubevirt.io/ksm-enabled` label to those nodes
that match the labelSelector defined in kvConfiguration. The value of that
label will be true only if that node is under pressure, false otherwise.

This is wrong since we need to provide users the information that a
particular node is configured to use ksm, so that it can
be used to schedule vms there.

Signed-off-by: fossedihelm <ffossemo@redhat.com>
@kubevirt-bot kubevirt-bot removed lgtm Indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Feb 26, 2024
@fossedihelm
Copy link
Contributor Author

@Barakmor1 Changes rebase

@Barakmor1
Copy link
Member

/lgtm

@kubevirt-bot kubevirt-bot added the lgtm Indicates that a PR is ready to be merged. label Feb 26, 2024
Copy link
Contributor

@jean-edouard jean-edouard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@kubevirt-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jean-edouard

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kubevirt-bot kubevirt-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 26, 2024
@kubevirt-commenter-bot
Copy link

Required labels detected, running phase 2 presubmits:
/test pull-kubevirt-e2e-windows2016
/test pull-kubevirt-e2e-kind-1.27-vgpu
/test pull-kubevirt-e2e-kind-sriov
/test pull-kubevirt-e2e-k8s-1.29-ipv6-sig-network
/test pull-kubevirt-e2e-k8s-1.27-sig-network
/test pull-kubevirt-e2e-k8s-1.27-sig-storage
/test pull-kubevirt-e2e-k8s-1.27-sig-compute
/test pull-kubevirt-e2e-k8s-1.27-sig-operator
/test pull-kubevirt-e2e-k8s-1.28-sig-network
/test pull-kubevirt-e2e-k8s-1.28-sig-storage
/test pull-kubevirt-e2e-k8s-1.28-sig-compute
/test pull-kubevirt-e2e-k8s-1.28-sig-operator

@kubevirt-commenter-bot
Copy link

/retest-required
This bot automatically retries required jobs that failed/flaked on approved PRs.
Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

1 similar comment
@kubevirt-commenter-bot
Copy link

/retest-required
This bot automatically retries required jobs that failed/flaked on approved PRs.
Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@fossedihelm
Copy link
Contributor Author

/cherrypick release-1.2

@kubevirt-bot
Copy link
Contributor

@fossedihelm: once the present PR merges, I will cherry-pick it on top of release-1.2 in a new PR and assign it to you.

In response to this:

/cherrypick release-1.2

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@kubevirt-commenter-bot
Copy link

/retest-required
This bot automatically retries required jobs that failed/flaked on approved PRs.
Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@fossedihelm
Copy link
Contributor Author

/retest

@kubevirt-bot kubevirt-bot merged commit 3499434 into kubevirt:main Feb 28, 2024
@kubevirt-bot
Copy link
Contributor

@fossedihelm: #11306 failed to apply on top of branch "release-1.2":

Applying: fix(ksm): set ksm-enabled label to true independently from node pressure
Using index info to reconstruct a base tree...
M	pkg/virt-handler/heartbeat/ksm.go
M	staging/src/kubevirt.io/api/core/v1/types.go
Falling back to patching base and 3-way merge...
Auto-merging staging/src/kubevirt.io/api/core/v1/types.go
Auto-merging pkg/virt-handler/heartbeat/ksm.go
CONFLICT (content): Merge conflict in pkg/virt-handler/heartbeat/ksm.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 fix(ksm): set ksm-enabled label to true independently from node pressure
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherrypick release-1.2

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API lgtm Indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants