Skip to content

improve device farm parallelism and reliability#72429

Draft
davidsbailey wants to merge 9 commits intostagingfrom
expand-mobile-device-pool
Draft

improve device farm parallelism and reliability#72429
davidsbailey wants to merge 9 commits intostagingfrom
expand-mobile-device-pool

Conversation

@davidsbailey
Copy link
Copy Markdown
Member

@davidsbailey davidsbailey commented Apr 30, 2026

In preparation for moving real DTT traffic from saucelabs to device farm, this PR increases parallelism and makes some reliability and usability improvements:

  • increase mobile parallelism from 20 to 40
  • expand the pools of mobile devices (iPad and iPhone) to mitigate availability issues with higher concurrency
  • display mobile device info in cucumber logs:
    • Screenshot 2026-04-30 at 10 36 06 PM
  • fix segfault issues when calling runner.rb --parallel ... on mac
  • pass --device-farm and --db flags through to Copy Rerun Cmd button
  • use exponential backoff to mitigate throttling errors when requesting desktop browser sessions
  • temporarily disambiguate device farm output filenames from saucelabs ones via _df_output.html suffix
  • replace --magic_retry with --retry_count 2
  • finalize or revert wip changes to print debug output --> gate on chat client secrets?
  • actually, stop passing --device-farm to Copy Rerun Cmd button for now

Testing story

To validate, I ran device farm rake tasks locally, with temporary modifications to make sure we (1) hit test-studio.code.org (2) skip tests which require @dashboard_db_access.

Device Farm Desktop

Dave-M4:~/src/cdo$ bundle exec rake test:devicefarm_desktop_ui
...
dsb@development:/Users/dsb/src/cdo$ cd /Users/dsb/src/cdo/dashboard/test/ui && bundle exec ./runner.rb --device-farm -c Chrome,Firefox --parallel 50 --retry_count 2 --with-status-page --fail_fast
...
369 passed. 0 failed. Test count: 369. Duration: 26:14 minutes. Total successful reruns of flaky tests: 25.

of the 25 successful reruns, 19 were due to Aws::DeviceFarm::Errors::ThrottlingException errors which then passed on rerun.

Device Farm Mobile

Dave-M4:~/src/cdo$ bundle exec rake test:devicefarm_mobile_ui
...
dsb@development:/Users/dsb/src/cdo$ cd /Users/dsb/src/cdo/dashboard/test/ui && bundle exec ./runner.rb --device-farm -c iPhone,iPad --parallel 40 --retry_count 2 --with-status-page --fail_fast
...
173 passed. 0 failed. Test count: 173. Duration: 35:40 minutes. Total successful reruns of flaky tests: 29.

Deployment notes

  • run rake tasks (unmodified) on test machine to validate test suite performance / pass rate when @dashboard_db_access tests are included.

davidsbailey and others added 9 commits April 29, 2026 16:52
Set OS_ACTIVITY_MODE=disable and OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES
on Darwin at the top of runner.rb, before any require can initialize
libsystem_trace or the ObjC runtime. Without this, Parallel.map's
in_processes workers crash on macOS in libsystem_trace
_os_log_preferences_refresh during their first AWS S3 log upload.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
create_mobile_session now also returns the picked Aws::DeviceFarm::Types::Device,
so connect.rb can identify the provisioned device alongside the existing
visual-log line in the cucumber HTML output.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirror the runner's $options.device_farm and $options.force_db_access
flags into the haml render context, and emit them in the rerun command
only when set. Also switch the rerun-command construction to a flag
array so non-Eyes builds no longer have a double-space artifact.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Switch the Aws::DeviceFarm::Client to retry_mode: 'adaptive' with
retry_limit: 10 to absorb ThrottlingException bursts when many forked
workers call create_test_grid_url near-simultaneously under high
--parallel desktop runs. The token bucket is per-process post-fork, so
this is per-worker absorption rather than cross-worker coordination.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
During the SauceLabs-to-Device-Farm migration we run both providers
concurrently on the chef-managed test machine, where they collide on
shared per-feature output filenames (Chrome_<feature>_output.html, the
matching .rerun file, and the same-key S3 uploads). Append _df to the
test_run_identifier when --device-farm is set, mirroring the existing
_eyes pattern, and teach test_status.js to read a new haml-emitted
hidden input so each provider's status page only matches its own keys.

To remove once we settle on a single provider per browser, the change
is the inverse of these three edits.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A run with --parallel 50 still hit Aws::DeviceFarm::Errors::ThrottlingException
on CreateTestGridUrl ~78s into the suite, where the existing 10-retry
budget (worst-case ~150s of full-jitter backoff capped at 20s/attempt)
isn't enough to ride out a multi-minute sustained throttle. 20 retries
extends the worst-case window to ~350s without otherwise changing the
retry policy. A separate AWS support ticket has been filed to raise the
per-account TPS limit on CreateTestGridUrl.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@davidsbailey davidsbailey changed the title more prep before moving to device farm improve device farm parallelism and reliability May 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant