Skip to content

Fix sequential bottleneck in parallel parsing#21291

Merged
JukkaL merged 11 commits intomasterfrom
more-parallel-parse
Apr 22, 2026
Merged

Fix sequential bottleneck in parallel parsing#21291
JukkaL merged 11 commits intomasterfrom
more-parallel-parse

Conversation

@JukkaL
Copy link
Copy Markdown
Collaborator

@JukkaL JukkaL commented Apr 22, 2026

Previously we always read the file, processed inline comments, and calculated sha1 for each parsed file sequentially in Python. Now these are mostly moved to the Rust extension, which allows better parallel scaling.

I measured ~5% improvement to parallel type checking times in some cases on macOS (though it was a bit noisy, and used an earlier version of this PR).

Related to #21215.

JukkaL added 5 commits April 22, 2026 14:01
The file is now usually only read in the Rust extension. This improves
parallel scaling, as `get_source()` was a sequential bottleneck. I
measured ~5% improvement to parallel type checking times in some cases
on macOS (though it was a bit noisy).
@JukkaL JukkaL requested a review from ilevkivskyi April 22, 2026 14:08
@github-actions

This comment has been minimized.

Copy link
Copy Markdown
Member

@ilevkivskyi ilevkivskyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I found a logical flaw that may limit the performance improvement from this. As mentioned in parse_all() docstring, it is a god idea to keep it in roughly 1:1 correspondence with parse_file(), see more details in review comments.

Comment thread mypy/build.py
state.needs_parse = False
# New parser reads source from file directly, we do this only for
# the side effect of parsing inline mypy configurations.
state.get_source()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to (conditionally) remove the same call in State.parse_file(), otherwise the worker will call it when loading the tree (look for state.parse_file(raw_data=raw_data) in worker.py).

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

Comment thread mypy/build.py Outdated
self.errors.set_file(state.xpath, state.id, state.options)
for lineno, error in config_errors:
self.error(lineno, error)
state.check_for_invalid_options()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is getting really long, maybe it is possible to factor out tree loading logic to a separate method? (Especially in the view of the comment above, since you will probably need this in parse_file() as well).

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored the parallel part into a separate method.

Comment thread mypy/build.py Outdated
self.log(f"Using cached AST for {state.xpath} ({state.id})")
state.tree, state.early_errors = self.ast_cache[state.id]
state.tree, state.early_errors, source_hash = self.ast_cache[state.id]
if state.source_hash is None:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the is None check needed here?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the check.

Comment thread mypy/build.py Outdated
manager.log(f"Using cached AST for {self.xpath} ({self.id})")
self.tree, self.early_errors = manager.ast_cache[self.id]
self.tree, self.early_errors, source_hash = manager.ast_cache[self.id]
if self.source_hash is None:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, not sure why is None is needed.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

Comment thread mypy/nodes.py
self.is_partial_stub_package = is_partial_stub_package
self.uses_template_strings = uses_template_strings
self.source_hash = source_hash
self.mypy_comments = mypy_comments if mypy_comments is not None else []
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these two (or at least the second one) need to be sent to the worker, i.e. you will need to handle them in write() and read(). The worker needs to know the full options, since we don't send options over the socket for each module (it is a big object). I guess tests pass now, because the worker still calls get_source().

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added serialiation back (I had it removed since I thought it's not needed).

@github-actions
Copy link
Copy Markdown
Contributor

According to mypy_primer, this change doesn't affect type check results on a corpus of open source code. ✅

Copy link
Copy Markdown
Member

@ilevkivskyi ilevkivskyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG, thanks! We will be able to simplify this when we will have just one parser.

@JukkaL JukkaL merged commit 781f1e6 into master Apr 22, 2026
24 checks passed
@JukkaL JukkaL deleted the more-parallel-parse branch April 22, 2026 17:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants