[DRAFT] feat: client side metrics handlers#16760
[DRAFT] feat: client side metrics handlers#16760daniel-sanche wants to merge 6 commits intobigtable_csm_2_instrumentation_advancedfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request implements client-side metrics for the Bigtable library using OpenTelemetry, including a custom exporter for Google Cloud Monitoring. The review feedback focuses on improving resource efficiency by moving the MeterProvider and client_uid generation to the client level to avoid thread leaks and inconsistent identifiers across tables. Additionally, recommendations were made to handle potential KeyError exceptions in the exporter, improve logging for background export failures, and ensure non-negative timeouts during batch writes.
| gcp_reader = PeriodicExportingMetricReader( | ||
| exporter, export_interval_millis=export_interval * 1000 | ||
| ) | ||
| # use private meter provider to store instruments and views | ||
| self.meter_provider = MeterProvider( | ||
| metric_readers=[gcp_reader], views=VIEW_LIST | ||
| ) |
There was a problem hiding this comment.
Creating a new PeriodicExportingMetricReader and MeterProvider for every GoogleCloudMetricsHandler is highly inefficient. Since a handler is created for every Table instance, and each PeriodicExportingMetricReader starts its own background thread, this will lead to a significant thread leak and excessive resource consumption (e.g., if a user accesses hundreds of tables).
The MeterProvider and its associated reader should be initialized once at the BigtableDataClient level and shared across all table handlers. This also ensures that metrics are properly flushed and threads are shut down when the client is closed.
| "instance": data_point.attributes[ | ||
| "resource_instance" | ||
| ], | ||
| "cluster": data_point.attributes[ | ||
| "resource_cluster" | ||
| ], | ||
| "table": data_point.attributes["resource_table"], | ||
| "zone": data_point.attributes["resource_zone"], |
There was a problem hiding this comment.
Accessing attributes directly via keys will raise a KeyError if any of the expected resource labels are missing from the data point. This is a known issue mentioned in the PR description. Using .get() or validating the presence of these keys is necessary for a robust exporter.
"instance": data_point.attributes.get("resource_instance", ""),
"cluster": data_point.attributes.get("resource_cluster", ""),
"table": data_point.attributes.get("resource_table", ""),
"zone": data_point.attributes.get("resource_zone", ""),| for data_point in [ | ||
| pt for pt in metric.data.data_points if pt.attributes | ||
| ]: | ||
| if data_point.attributes: |
| try: | ||
| self._batch_write(all_series, deadline) | ||
| return MetricExportResult.SUCCESS | ||
| except Exception: | ||
| return MetricExportResult.FAILURE |
There was a problem hiding this comment.
Catching all exceptions and returning FAILURE without logging makes it very difficult to diagnose issues when metrics fail to export. Since this code runs in a background thread, these errors will be silent. Following repository guidelines for background tasks and broad exception blocks, log the exception as a warning (e.g., using logger.warning) to aid in debugging without creating excessive noise from transient failures.
References
- For exceptions in background tasks that are designed to be retried, log them as warnings rather than errors to reduce noise from transient, recoverable failures.
- Avoid broad
except Exception:blocks that silently returnNone. Instead, log the exception (e.g., usinglogger.warning) to aid in debugging and prevent masking underlying issues.
| write_ind = 0 | ||
| while write_ind < len(series): | ||
| # find time left for next batch | ||
| timeout = deadline - time.time() if deadline else gapic_v1.method.DEFAULT |
There was a problem hiding this comment.
If the deadline has already passed, the calculated timeout will be negative. It is safer to ensure the timeout is at least zero before passing it to the GAPIC client.
| timeout = deadline - time.time() if deadline else gapic_v1.method.DEFAULT | |
| timeout = max(0, deadline - time.time()) if deadline else gapic_v1.method.DEFAULT |
| # fixed labels sent with each metric update | ||
| self.shared_labels = { | ||
| "client_name": f"python-bigtable/{client_version}", | ||
| "client_uid": client_uid or self._generate_client_uid(), |
There was a problem hiding this comment.
The client_uid is currently generated per handler instance. Since handlers are created per table, this results in different UIDs for the same client process across different tables. The client_uid should be stable for the lifetime of the BigtableDataClient to allow for consistent aggregation of metrics from a single client instance in the backend. Consider generating the UID once in the client and passing it to the handlers.
Migrate googleapis/python-bigtable#1189 to the monorepo
This PR builds off of googleapis/python-bigtable#1187 to add handlers to the client-side metrics system, which can subscribe to the metrics stream, and export the results into different collection systems
We add two handlers to the system:
GoogleCloudMetricsHandler: sends metrics to a private OpenTelemetry meter, and then periodically exports them to GCP. Built on top ofOpenTelemetryMetricsHandlerOpenTelemetryMetricsHandler: sends metrics to the root MeterProvider, so the user can access the exported metrics for their own systems. This will be off by default, but can be added alongsideGoogleCloudMetricsHandlerif neededTODO: