Skip to content

[DRAFT] feat: client side metrics handlers#16760

Draft
daniel-sanche wants to merge 6 commits intobigtable_csm_2_instrumentation_advancedfrom
bigtable_csm_3_handlers
Draft

[DRAFT] feat: client side metrics handlers#16760
daniel-sanche wants to merge 6 commits intobigtable_csm_2_instrumentation_advancedfrom
bigtable_csm_3_handlers

Conversation

@daniel-sanche
Copy link
Copy Markdown
Contributor

@daniel-sanche daniel-sanche commented Apr 22, 2026

Migrate googleapis/python-bigtable#1189 to the monorepo

This PR builds off of googleapis/python-bigtable#1187 to add handlers to the client-side metrics system, which can subscribe to the metrics stream, and export the results into different collection systems

We add two handlers to the system:

  • GoogleCloudMetricsHandler: sends metrics to a private OpenTelemetry meter, and then periodically exports them to GCP. Built on top of OpenTelemetryMetricsHandler
  • OpenTelemetryMetricsHandler: sends metrics to the root MeterProvider, so the user can access the exported metrics for their own systems. This will be off by default, but can be added alongside GoogleCloudMetricsHandler if needed

TODO:

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements client-side metrics for the Bigtable library using OpenTelemetry, including a custom exporter for Google Cloud Monitoring. The review feedback focuses on improving resource efficiency by moving the MeterProvider and client_uid generation to the client level to avoid thread leaks and inconsistent identifiers across tables. Additionally, recommendations were made to handle potential KeyError exceptions in the exporter, improve logging for background export failures, and ensure non-negative timeouts during batch writes.

Comment on lines +108 to +114
gcp_reader = PeriodicExportingMetricReader(
exporter, export_interval_millis=export_interval * 1000
)
# use private meter provider to store instruments and views
self.meter_provider = MeterProvider(
metric_readers=[gcp_reader], views=VIEW_LIST
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Creating a new PeriodicExportingMetricReader and MeterProvider for every GoogleCloudMetricsHandler is highly inefficient. Since a handler is created for every Table instance, and each PeriodicExportingMetricReader starts its own background thread, this will lead to a significant thread leak and excessive resource consumption (e.g., if a user accesses hundreds of tables).

The MeterProvider and its associated reader should be initialized once at the BigtableDataClient level and shared across all table handlers. This also ensures that metrics are properly flushed and threads are shut down when the client is closed.

Comment on lines +164 to +171
"instance": data_point.attributes[
"resource_instance"
],
"cluster": data_point.attributes[
"resource_cluster"
],
"table": data_point.attributes["resource_table"],
"zone": data_point.attributes["resource_zone"],
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Accessing attributes directly via keys will raise a KeyError if any of the expected resource labels are missing from the data point. This is a known issue mentioned in the PR description. Using .get() or validating the presence of these keys is necessary for a robust exporter.

                                    "instance": data_point.attributes.get("resource_instance", ""),
                                    "cluster": data_point.attributes.get("resource_cluster", ""),
                                    "table": data_point.attributes.get("resource_table", ""),
                                    "zone": data_point.attributes.get("resource_zone", ""),

Comment on lines +156 to +159
for data_point in [
pt for pt in metric.data.data_points if pt.attributes
]:
if data_point.attributes:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The list comprehension and redundant if check can be simplified to improve readability and avoid unnecessary memory allocation.

                    for data_point in metric.data.data_points:
                        if data_point.attributes:

Comment on lines +191 to +195
try:
self._batch_write(all_series, deadline)
return MetricExportResult.SUCCESS
except Exception:
return MetricExportResult.FAILURE
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Catching all exceptions and returning FAILURE without logging makes it very difficult to diagnose issues when metrics fail to export. Since this code runs in a background thread, these errors will be silent. Following repository guidelines for background tasks and broad exception blocks, log the exception as a warning (e.g., using logger.warning) to aid in debugging without creating excessive noise from transient failures.

References
  1. For exceptions in background tasks that are designed to be retried, log them as warnings rather than errors to reduce noise from transient, recoverable failures.
  2. Avoid broad except Exception: blocks that silently return None. Instead, log the exception (e.g., using logger.warning) to aid in debugging and prevent masking underlying issues.

write_ind = 0
while write_ind < len(series):
# find time left for next batch
timeout = deadline - time.time() if deadline else gapic_v1.method.DEFAULT
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If the deadline has already passed, the calculated timeout will be negative. It is safer to ensure the timeout is at least zero before passing it to the GAPIC client.

Suggested change
timeout = deadline - time.time() if deadline else gapic_v1.method.DEFAULT
timeout = max(0, deadline - time.time()) if deadline else gapic_v1.method.DEFAULT

# fixed labels sent with each metric update
self.shared_labels = {
"client_name": f"python-bigtable/{client_version}",
"client_uid": client_uid or self._generate_client_uid(),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The client_uid is currently generated per handler instance. Since handlers are created per table, this results in different UIDs for the same client process across different tables. The client_uid should be stable for the lifetime of the BigtableDataClient to allow for consistent aggregation of metrics from a single client instance in the backend. Consider generating the UID once in the client and passing it to the handlers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant