Fix sequential bottleneck in parallel parsing by JukkaL · Pull Request #21291 · python/mypy

JukkaL · 2026-04-22T14:08:16Z

Previously we always read the file, processed inline comments, and calculated sha1 for each parsed file sequentially in Python. Now these are mostly moved to the Rust extension, which allows better parallel scaling.

I measured ~5% improvement to parallel type checking times in some cases on macOS (though it was a bit noisy, and used an earlier version of this PR).

Related to #21215.

The file is now usually only read in the Rust extension. This improves parallel scaling, as `get_source()` was a sequential bottleneck. I measured ~5% improvement to parallel type checking times in some cases on macOS (though it was a bit noisy).

ilevkivskyi

I think I found a logical flaw that may limit the performance improvement from this. As mentioned in parse_all() docstring, it is a god idea to keep it in roughly 1:1 correspondence with parse_file(), see more details in review comments.

ilevkivskyi · 2026-04-22T14:38:10Z

                state.needs_parse = False
-                # New parser reads source from file directly, we do this only for
-                # the side effect of parsing inline mypy configurations.
-                state.get_source()


I think you need to (conditionally) remove the same call in State.parse_file(), otherwise the worker will call it when loading the tree (look for state.parse_file(raw_data=raw_data) in worker.py).

ilevkivskyi · 2026-04-22T14:40:52Z

+                            self.errors.set_file(state.xpath, state.id, state.options)
+                            for lineno, error in config_errors:
+                                self.error(lineno, error)
+                        state.check_for_invalid_options()


This method is getting really long, maybe it is possible to factor out tree loading logic to a separate method? (Especially in the view of the comment above, since you will probably need this in parse_file() as well).

Refactored the parallel part into a separate method.

ilevkivskyi · 2026-04-22T14:41:31Z

                    self.log(f"Using cached AST for {state.xpath} ({state.id})")
-                    state.tree, state.early_errors = self.ast_cache[state.id]
+                    state.tree, state.early_errors, source_hash = self.ast_cache[state.id]
+                    if state.source_hash is None:


Why the is None check needed here?

Remove the check.

ilevkivskyi · 2026-04-22T14:42:46Z

            manager.log(f"Using cached AST for {self.xpath} ({self.id})")
-            self.tree, self.early_errors = manager.ast_cache[self.id]
+            self.tree, self.early_errors, source_hash = manager.ast_cache[self.id]
+            if self.source_hash is None:


Same here, not sure why is None is needed.

ilevkivskyi · 2026-04-22T14:46:21Z

        self.is_partial_stub_package = is_partial_stub_package
        self.uses_template_strings = uses_template_strings
+        self.source_hash = source_hash
+        self.mypy_comments = mypy_comments if mypy_comments is not None else []


I think these two (or at least the second one) need to be sent to the worker, i.e. you will need to handle them in write() and read(). The worker needs to know the full options, since we don't send options over the socket for each module (it is a big object). I guess tests pass now, because the worker still calls get_source().

Added serialiation back (I had it removed since I thought it's not needed).

This reverts commit 98ead81.

github-actions · 2026-04-22T16:46:27Z

According to mypy_primer, this change doesn't affect type check results on a corpus of open source code. ✅

ilevkivskyi

LG, thanks! We will be able to simplify this when we will have just one parser.

JukkaL added 5 commits April 22, 2026 14:01

Don't serialize/deserialize

98ead81

Fix cached reads

2877475

Process inline config earlier

4b84a57

Add inline config test case

bc0d231

JukkaL requested a review from ilevkivskyi April 22, 2026 14:08

This comment has been minimized.

Sign in to view

ilevkivskyi reviewed Apr 22, 2026

View reviewed changes

JukkaL added 6 commits April 22, 2026 16:19

Revert "Don't serialize/deserialize"

68def94

This reverts commit 98ead81.

Address feedback

6941a1a

Apply inline config more consistently

b0cac68

Minor tweak

5d8d544

Update comment

bc0fd49

Refactor based on feedback

1068116

ilevkivskyi approved these changes Apr 22, 2026

View reviewed changes

JukkaL merged commit 781f1e6 into master Apr 22, 2026
24 checks passed

JukkaL deleted the more-parallel-parse branch April 22, 2026 17:01

Uh oh!

Conversation

JukkaL commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment has been minimized.

ilevkivskyi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

ilevkivskyi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JukkaL commented Apr 22, 2026 •

edited

Loading