Skip to content

feat(metadata:diskless): add controller metrics for diskless topics#503

Merged
viktorsomogyi merged 2 commits intomainfrom
jeqo/pod-2001-controller-topic-metrics
Mar 11, 2026
Merged

feat(metadata:diskless): add controller metrics for diskless topics#503
viktorsomogyi merged 2 commits intomainfrom
jeqo/pod-2001-controller-topic-metrics

Conversation

@jeqo
Copy link
Contributor

@jeqo jeqo commented Feb 9, 2026

Adds controller metrics to track diskless topics, building on #492 (managed replicas).

Changes

  • Add DisklessTopicCount, DisklessPartitionCount, and DisklessOfflinePartitionCount gauges to the controller
  • Exclude diskless partitions from OfflinePartitionsCount and PreferredReplicaImbalanceCount (transformer handles availability)
  • Include diskless partitions in GlobalPartitionCount — classic partitions can be derived as GlobalPartitionCount - DisklessPartitionCount
  • Resolve diskless status directly from MetadataImage configs in both snapshot and delta paths

New Metrics

Metric Description
DisklessTopicCount Total number of diskless topics
DisklessPartitionCount Total number of diskless partitions
DisklessOfflinePartitionCount Diskless partitions with no leader in KRaft metadata

Test plan

  • Unit tests for all new metric gauges (set/add/get)
  • Snapshot path: verify counts for diskless and non-diskless images
  • Delta path: verify diskless partitions excluded from offline/imbalance
  • Metric names registered and cleaned up on close

Related

@jeqo jeqo force-pushed the jeqo/pod-2001-controller-topic-metrics branch from 6ddf9d8 to 742bb4f Compare February 9, 2026 14:48
@jeqo jeqo force-pushed the jeqo/pod-2001-diskless-managed-replica branch 3 times, most recently from 6fba0c9 to 1f3c7e1 Compare February 9, 2026 19:06
@jeqo jeqo force-pushed the jeqo/pod-2001-controller-topic-metrics branch from 742bb4f to 4de2bb4 Compare February 9, 2026 19:36
@jeqo jeqo force-pushed the jeqo/pod-2001-diskless-managed-replica branch 3 times, most recently from 82dbca5 to 5f41dde Compare February 10, 2026 00:46
@jeqo jeqo force-pushed the jeqo/pod-2001-diskless-managed-replica branch 2 times, most recently from 18e49b9 to 20c315d Compare March 4, 2026 11:15
@jeqo jeqo force-pushed the jeqo/pod-2001-controller-topic-metrics branch from 4de2bb4 to 919d6a7 Compare March 4, 2026 13:54
@jeqo jeqo force-pushed the jeqo/pod-2001-diskless-managed-replica branch from 20c315d to 9a44b38 Compare March 4, 2026 14:37
Base automatically changed from jeqo/pod-2001-diskless-managed-replica to main March 10, 2026 10:52
@jeqo jeqo force-pushed the jeqo/pod-2001-controller-topic-metrics branch from 919d6a7 to 9be9a4c Compare March 10, 2026 10:57
@jeqo jeqo requested a review from Copilot March 10, 2026 10:58

This comment was marked as outdated.

@jeqo jeqo force-pushed the jeqo/pod-2001-controller-topic-metrics branch from 9be9a4c to bc28ceb Compare March 10, 2026 13:17
@jeqo jeqo requested a review from Copilot March 10, 2026 14:24
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

metadata/src/main/java/org/apache/kafka/controller/metrics/ControllerMetadataMetricsPublisher.java:131

  • publishDelta now computes diskless-ness from newImage/prevImage configs and updates the new diskless metrics on LOG_DELTA updates, but there are no unit tests covering the LOG_DELTA path for diskless topics (create/delete/partition changes). Adding a test that drives onMetadataUpdate(..., fakeManifest(false)) and asserts diskless counts change correctly would help prevent regressions in the real-time update logic.
    private void publishDelta(MetadataDelta delta, MetadataImage newImage) {
        // Use newImage configs to check if topic is diskless, as the metadata cache
        // may not have the config yet when processing deltas for newly created topics
        Function<String, Boolean> isDisklessFromImage = topicName -> {
            ConfigResource resource = new ConfigResource(ConfigResource.Type.TOPIC, topicName);
            Properties props = newImage.configs().configProperties(resource);
            return Boolean.parseBoolean(props.getProperty(TopicConfig.DISKLESS_ENABLE_CONFIG, "false"));
        };
        ControllerMetricsChanges changes = new ControllerMetricsChanges(isDisklessFromImage);
        if (delta.clusterDelta() != null) {
            for (Entry<Integer, Optional<BrokerRegistration>> entry :
                    delta.clusterDelta().changedBrokers().entrySet()) {
                changes.handleBrokerChange(
                    prevImage.cluster().brokers().get(entry.getKey()),
                    entry.getValue().orElse(null),
                    metrics
                );
            }
        }
        if (delta.topicsDelta() != null) {
            for (Uuid topicId : delta.topicsDelta().deletedTopicIds()) {
                TopicImage prevTopic = prevImage.topics().topicsById().get(topicId);
                if (prevTopic == null) {
                    throw new RuntimeException("Unable to find deleted topic id " + topicId +
                            " in previous topics image.");
                }
                // For deleted topics, check isDiskless from prevImage since config is already removed from newImage
                ConfigResource resource = new ConfigResource(ConfigResource.Type.TOPIC, prevTopic.name());
                Properties props = prevImage.configs().configProperties(resource);
                boolean isDiskless = Boolean.parseBoolean(props.getProperty(TopicConfig.DISKLESS_ENABLE_CONFIG, "false"));
                changes.handleDeletedTopic(prevTopic, isDiskless);
            }
            for (Entry<Uuid, TopicDelta> entry : delta.topicsDelta().changedTopics().entrySet()) {
                changes.handleTopicChange(prevImage.topics().getTopic(entry.getKey()), entry.getValue());
            }
        }
        changes.apply(metrics);

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@jeqo jeqo force-pushed the jeqo/pod-2001-controller-topic-metrics branch 2 times, most recently from e355a6d to 3bb19bc Compare March 10, 2026 16:11
@jeqo jeqo requested a review from Copilot March 10, 2026 16:20
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@jeqo jeqo force-pushed the jeqo/pod-2001-controller-topic-metrics branch from 3bb19bc to 0380c0b Compare March 10, 2026 16:55
@jeqo jeqo requested a review from Copilot March 10, 2026 16:56
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Add partition-level diskless metrics to the controller:
DisklessTopicCount, DisklessPartitionCount, and
DisklessOfflinePartitionCount.

These gauges provide operational visibility into diskless
partition health. Diskless partitions are excluded from
OfflinePartitions and PreferredReplicaImbalanceCount (the
transformer handles availability), but included in
GlobalPartitionCount. Diskless status is resolved directly
from MetadataImage configs in both snapshot and delta paths.
@jeqo jeqo force-pushed the jeqo/pod-2001-controller-topic-metrics branch from 0380c0b to 90fa55a Compare March 10, 2026 17:22
@jeqo jeqo marked this pull request as ready for review March 10, 2026 21:00
viktorsomogyi
viktorsomogyi previously approved these changes Mar 11, 2026
Copy link

@viktorsomogyi viktorsomogyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM but we probably need a follow-up when the classic <-> diskless will be complete to handle that scenario too.

@jeqo
Copy link
Contributor Author

jeqo commented Mar 11, 2026

@viktorsomogyi I have added a fixup to handle the switch scenario -- seems worth having it already to support those config updates. PTAL

@viktorsomogyi viktorsomogyi merged commit 879b72e into main Mar 11, 2026
5 checks passed
@viktorsomogyi viktorsomogyi deleted the jeqo/pod-2001-controller-topic-metrics branch March 11, 2026 12:53
AnatolyPopov pushed a commit that referenced this pull request Mar 23, 2026
…503)

Add partition-level diskless metrics to the controller:
DisklessTopicCount, DisklessPartitionCount, and
DisklessOfflinePartitionCount.

These gauges provide operational visibility into diskless
partition health. Diskless partitions are excluded from
OfflinePartitions and PreferredReplicaImbalanceCount (the
transformer handles availability), but included in
GlobalPartitionCount. Diskless status is resolved directly
from MetadataImage configs in both snapshot and delta paths.

(cherry picked from commit 879b72e)

# Conflicts:
#	core/src/main/scala/kafka/server/ControllerServer.scala
jeqo added a commit that referenced this pull request Mar 23, 2026
…503)

Add partition-level diskless metrics to the controller:
DisklessTopicCount, DisklessPartitionCount, and
DisklessOfflinePartitionCount.

These gauges provide operational visibility into diskless
partition health. Diskless partitions are excluded from
OfflinePartitions and PreferredReplicaImbalanceCount (the
transformer handles availability), but included in
GlobalPartitionCount. Diskless status is resolved directly
from MetadataImage configs in both snapshot and delta paths.
jeqo added a commit that referenced this pull request Mar 23, 2026
…503)

Add partition-level diskless metrics to the controller:
DisklessTopicCount, DisklessPartitionCount, and
DisklessOfflinePartitionCount.

These gauges provide operational visibility into diskless
partition health. Diskless partitions are excluded from
OfflinePartitions and PreferredReplicaImbalanceCount (the
transformer handles availability), but included in
GlobalPartitionCount. Diskless status is resolved directly
from MetadataImage configs in both snapshot and delta paths.

(cherry picked from commit 879b72e)

# Conflicts:
#	core/src/main/scala/kafka/server/ControllerServer.scala
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants