Add composefs-ostree and some basic CLI tools #144
Conversation
e0e827f to
9c5b086
Compare
allisonkarlitskaya
left a comment
There was a problem hiding this comment.
I love this! Thanks for working on it!
I made some comments on the first round of commits. Feel free to adjust those and PR them separately: we can merge those now without further discussion.
The blobs thing is going to need a call.
I didn't review the crate addition in any detail at all. That's probably also going to need a call :)
9c5b086 to
cd067c5
Compare
|
Hmmm, thinking more about this. We probably want a "content type" magic thing in the splitstream header as well, so we can error out if the wrapped thing is of the wrong type. |
2ed83a2 to
c041afe
Compare
c041afe to
dd0bf65
Compare
|
Ok. Reworked this to use splitstreams for object maps and commits. And, by using an object mapping to find the object map we make the content of the splitstream for the commit be just the commit data, and thus the sha256 of that splitstream matches the ostree commit id. |
|
@allisonkarlitskaya There is still lots to do here. But have a look at this approach and see what you think. |
dd0bf65 to
d6a5b39
Compare
|
Added some further changes. We now validate all objects when pulling and all non-file objects when creating images. Its hard to efficiently validate file objects during create-image though, we would like to avoid re-reading the external object files to compute the sha256. Remaining things to do:
|
481e604 to
e88573d
Compare
|
I started working on the delta support, but it failed because of an issue in gvariant-rs. |
allisonkarlitskaya
left a comment
There was a problem hiding this comment.
It occurs to me that it might be interesting not to sort the table of fs-verity references, and it might also be interesting to permit duplicate items.
On the topic of deferring writing of objects to a background thread, this would allow us to write "external object #123" based on a sequential index to the splitstream without actually knowing the hash value yet, and then fill in the actual values in the header at the end when we're writing: it helps there that the fs-verity references aren't compressed and therefore not part of the stream...
|
It seems like we should get in the splitstream changes in 0f6d69e at least sooner rather than later? Can you file a separate PR? |
This changes the splitstream format a bit, with the goal of allowing splitstreams to support ostree files as well (see composefs#144) The primary differences are: * The header is not compressed * All referenced fs-verity objects are stored in the header, including external chunks, mapped splitstreams and (a new feature) references that are not used in chunks. * The mapping table is separate from the reference table (and generally smaller), and indexes into it. * There is a magic value to detect the file format. * There is a magic content type to detect the type wrapped in the stream. * We store a tag for what ObjectID format is used * The total size of the stream is stored in the header. The ability to reference file objects in the repo even if they are not part of the splitstream "content" will be useful for the ostree support to reference file content objects. This change also allows more efficient GC enumeration, because we don't have to parse the entire splitstream to find the referenced objects. Signed-off-by: Alexander Larsson <alexl@redhat.com>
This changes the splitstream format a bit, with the goal of allowing splitstreams to support ostree files as well (see composefs#144) The primary differences are: * The header is not compressed * All referenced fs-verity objects are stored in the header, including external chunks, mapped splitstreams and (a new feature) references that are not used in chunks. * The mapping table is separate from the reference table (and generally smaller), and indexes into it. * There is a magic value to detect the file format. * There is a magic content type to detect the type wrapped in the stream. * We store a tag for what ObjectID format is used * The total size of the stream is stored in the header. The ability to reference file objects in the repo even if they are not part of the splitstream "content" will be useful for the ostree support to reference file content objects. This change also allows more efficient GC enumeration, because we don't have to parse the entire splitstream to find the referenced objects. Signed-off-by: Alexander Larsson <alexl@redhat.com>
This changes the splitstream format a bit, with the goal of allowing splitstreams to support ostree files as well (see composefs#144) The primary differences are: * The header is not compressed * All referenced fs-verity objects are stored in the header, including external chunks, mapped splitstreams and (a new feature) references that are not used in chunks. * The mapping table is separate from the reference table (and generally smaller), and indexes into it. * There is a magic value to detect the file format. * There is a magic content type to detect the type wrapped in the stream. * We store a tag for what ObjectID format is used * The total size of the stream is stored in the header. The ability to reference file objects in the repo even if they are not part of the splitstream "content" will be useful for the ostree support to reference file content objects. This change also allows more efficient GC enumeration, because we don't have to parse the entire splitstream to find the referenced objects. Signed-off-by: Alexander Larsson <alexl@redhat.com>
c788da2 to
2ee193a
Compare
This changes the splitstream format a bit, with the goal of allowing splitstreams to support ostree files as well (see composefs#144) The primary differences are: * The header is not compressed * All referenced fs-verity objects are stored in the header, including external chunks, mapped splitstreams and (a new feature) references that are not used in chunks. * The mapping table is separate from the reference table (and generally smaller), and indexes into it. * There is a magic value to detect the file format. * There is a magic content type to detect the type wrapped in the stream. * We store a tag for what ObjectID format is used * The total size of the stream is stored in the header. The ability to reference file objects in the repo even if they are not part of the splitstream "content" will be useful for the ostree support to reference file content objects. This change also allows more efficient GC enumeration, because we don't have to parse the entire splitstream to find the referenced objects. Signed-off-by: Alexander Larsson <alexl@redhat.com>
This changes the splitstream format a bit, with the goal of allowing splitstreams to support ostree files as well (see composefs#144) The primary differences are: * The header is not compressed * All referenced fs-verity objects are stored in the header, including external chunks, mapped splitstreams and (a new feature) references that are not used in chunks. * The mapping table is separate from the reference table (and generally smaller), and indexes into it. * There is a magic value to detect the file format. * There is a magic content type to detect the type wrapped in the stream. * We store a tag for what ObjectID format is used * The total size of the stream is stored in the header. The ability to reference file objects in the repo even if they are not part of the splitstream "content" will be useful for the ostree support to reference file content objects. This change also allows more efficient GC enumeration, because we don't have to parse the entire splitstream to find the referenced objects. Signed-off-by: Alexander Larsson <alexl@redhat.com>
2ee193a to
da310b0
Compare
da310b0 to
8b32f51
Compare
I think unless we prove out that composefs can be a very good way to store OCI, then it is not worth investing in. Thankfully that's not the case - I think it is (and I believe you do too!). So it's not that it has "nothing to do with OCI" (right?) - how about "has the capability to easily/natively store any type of content that one would want to represent as read-only immutable versioned filesystem trees". For example, today Android as far as I know uses fsverity on single zip files, and they've made it work quite well, but it's harder to get deduplication across apps that way, and maybe someday they go to a composefs-like model. |
cc33c5f to
7ac06a0
Compare
|
I rebased this, lets see if CI passes now. |
7ac06a0 to
0deb546
Compare
Just to be clear, when I said "it has nothing to do with OCI" I specifically meant composefs-ostree, not composefs-rs generally (which very clearly was designed with OCI in mind). Very obviously the main target of composefs-rs right now is bootc (OCI), probably followed by container storage (obviously also OCI). flatpak is probably a distant third at the moment, and indeed, even that has something to do with OCI (the current flatpak demo only works with OCI, in fact)...
Ya, that's sort of what I meant... it would be cool to show that you can really do a lot of different things with this stuff... |
1b98032 to
515fb7f
Compare
| if filetype.is_symlink() { | ||
| Ok((zlib_header, Box::new(empty()))) | ||
| } else { | ||
| let fd_path = format!("/proc/self/fd/{}", path_fd.as_fd().as_raw_fd()); |
There was a problem hiding this comment.
Tangential to this but I'd like to use https://docs.rs/crate/rustix-linux-procfs/latest I think
515fb7f to
9b1060f
Compare
|
I rebased this and fixes some comments. Still some work to do though. |
5fba232 to
1228e9b
Compare
Signed-off-by: Alexander Larsson <alexl@redhat.com>
This lets you look up a ref digest from the splitstream by index and is needed by the ostree code. Signed-off-by: Alexander Larsson <alexl@redhat.com>
This is basically ensure_object_from_fd(), but for anything implementing Read. basically ensure_object_from_fd() is reimplemented based on this. We will need this in the ostree support code for streaming a zlib compressed file to the repo. Signed-off-by: Alexander Larsson <alexl@redhat.com>
1228e9b to
8d8c6b2
Compare
|
Ok, i updated this to the latest version and added streaming creation of repo files and parallelized fetching. Plus some other cleanups. |
8d8c6b2 to
5837fb4
Compare
|
Ok, I sent some time on this, its now much more like the "cfsctl oci" commands and behavior, and it does parallel fetches. I also added various integration tests. I think this is pretty complete for what it does (i.e. imports ostree commits into composefs and lets you mount it). There are some TODOs for summary and delta support, but those are not necessarily super important for the basic functionallity. |
188640d to
efba46d
Compare
| ref ostree_ref, | ||
| base_name, | ||
| } => { | ||
| eprintln!("Fetching {ostree_ref}"); |
There was a problem hiding this comment.
Don't log via eprintln! we have the progress API now.
Also on that topic...I think we should expose a varlink API for this now, right?
I guess neither of these need to strictly block merging though.
🤔 I guess actually...if we go down this varlink path, perhaps in theory we could have both the oci and ostree fetchers be extension binaries i.e. something like /usr/libexec/composefs/ext/oci is automatically cfsctl oci? That could be interesting...and would actually force us to have a good "core" varlink api.
| for i in 1..256 { | ||
| // Bucket ends are (non-strictly) increasing | ||
| if buckets[i] < buckets[i - 1] { |
There was a problem hiding this comment.
In general in Rust many array accesses can be done more elegantly and more safely than just direct indexing. In this specific case I think https://doc.rust-lang.org/stable/std/primitive.slice.html#method.array_windows is what we want
There was a problem hiding this comment.
array_windows is unstable though, do we really want to use that?
There was a problem hiding this comment.
I used regular .windows() instead. Also, I spent some time in general rustifying the code and cleaning it up.
| // until the queue is drained and all in-flight fetches have completed. | ||
| let mut join_set: JoinSet<Result<FetchResult<ObjectID>>> = JoinSet::new(); | ||
|
|
||
| loop { |
There was a problem hiding this comment.
We can interleave metadata and data fetches, it's what libostree does. Is it worth the added complexity? Maybe not.
There was a problem hiding this comment.
probably not. This thing is actually surprisingly fast as is:
$ time target/debug/cfsctl --repo repo ostree pull https://dl.flathub.org/repo runtime/org.gnome.Platform/x86_64/50
Fetching runtime/org.gnome.Platform/x86_64/50
█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 16288/16288commit f6fb972824514aefc06b23d7d591192c9cba2ad72648bf0473d06a565c40c264
verity a2e16123b2310d65b6b886e89e4fc45ec61c47efd0b4daac05bd3132ac3d8f78890fcbb22d0975411917742d17e86ad52709c8ec1ff33a62b97e94f38528242a
image 079ade7aba4e5fef51b30a80e3d1805fd5f79d9a48f240340c7a94f98497d8265edee857563308c00574a1bab792168caa0152af04ef47ac15d8a935ca291e15
tagged runtime/org.gnome.Platform/x86_64/50
objects 2752 metadata + 16288 files fetched
real 0m13,006s
user 0m12,066s
sys 0m1,663s
$ du -csh repo
1,1G repo
1,1G total
| * | ||
| * Commit splitstreams are mappings from a set of ostree sha256 | ||
| * digests into the content for that ostree object. The content is | ||
| * defined as some data, and an optional ObjectID referencing an |
There was a problem hiding this comment.
I read this thing twice and didn't fully understand it. "The content is defined as some data" is vague 😄
Our goal is conceptually to define a serialization of an ostree commit into a single "stream", right? And then splitting out content objects as externals.
Hmm...why wouldn't it work to basically do what we do with tar, walk the commit in a depth-first manner, serializing metadata + externals as we go?
There was a problem hiding this comment.
Well, the "some data" is just generally the ostree object data for the digest.
We don't just want to have a serialization, because we also want to use the commit as a way to efficiently look up ostree objects in the commit. We use this during pull to avoid pulling objects that was in the previous version of the commit.
There was a problem hiding this comment.
So, its more like a hash table from sha256 digests to objects that are optionally external object ids.
There was a problem hiding this comment.
And its not actually "inline data OR external refs", we sometimes have to have both, because we need to store metadata as well. So, more like we store the archive-z2 file header in the data, and then content in the external ref.
There was a problem hiding this comment.
We don't just want to have a serialization, because we also want to use the commit as a way to efficiently look up ostree objects in the commit.
Would parsing all of the metadata be expensive though?
There was a problem hiding this comment.
I mean, would it be impossible to change it to something else? For sure not, but what is the point? Allison and I spent a fair amount of time creating the new splitstream format specifically for this use. So it is the format we have, and its intended to be efficient for what we use it for.
There was a problem hiding this comment.
OK, fair! But can you spend some tokens clarifying the docs a bit at least?
I get the efficiency idea, but one thing that seems odd to me right now is that because we store this ostree-specific thing in the split stream content, it ends up zstd compressed. So we're at least reading the whole thing into RAM, we can't mmap etc.
With putting tar in split stream, this all made sense because we basically don't look at the tar stream unless we're copying the image out.
Also, while I get that it was nontrivial to design the format, there's also the traditional "cost of maintenance > cost of writing" to consider. Splitstream is a good bit of complexity on its own, but I think it's turned out mostly OK because for the OCI case it basically is a wrapper for a very well known thing - tar (ok well tar is a mess too, but it's a well-known mess). This work here is combining split stream with two entirely different more bespoke formats (splitstream-ostree and ostree).
I guess one way I'd say this is if you have a data format, it should have the ability to be converted to JSON, have a "structure checker" like fsck etc.
I'm aware I ~lost this argument before but e.g. https://cbor.io is pretty widely used. Does pulling in cbor for just this secondary bespoke binary format have the right cost/benefit? Perhaps not. (But, since we already need it: why not gvariant?)
Dunno. This is a discussion, nothing I am saying here is blocking.
There was a problem hiding this comment.
I'll update the docs to be be more readable, comprehensive and documenting the final/current state of things. And, I agree that having them zstd compressed does make it a bit weird for this to claim "efficiency", although I sort of agree with Allisons more modern view of mmap and its problems.
That said, I fundamentally think a "bucket of sha256 indexed objects" is the more correct format for an ostree thing. Serializing an ostree commit tree just feels wrong. Like, would you then duplicate things that were shared in many places (like dirmetas, or hardlinks)?
There was a problem hiding this comment.
I'll update the docs to be be more readable, comprehensive and documenting the final/current state of things.
Thanks.
That said, I fundamentally think a "bucket of sha256 indexed objects" is the more correct format for an ostree thing. Serializing an ostree commit tree just feels wrong. Like, would you then duplicate things that were shared in many places (like dirmetas, or hardlinks)?
No, I think the obvious flattened serialization would just have "each object is emitted once" semantics. Hardlinks are implicit in the ostree format - the data doesn't have st_nlink.
There was a problem hiding this comment.
I added doc/ostree.md which has a more detailed documentation on the format, including some general faffing about ostree and how this is supposed to be used.
|
@allisonkarlitskaya You have a "changes requested" here which blocks merges |
Based on ideas from composefs#141 This is an initial version of ostree support. This allows pulling from local and remote ostree repos, which will create a set of regular file content objects, as well as a commit splitstream containing all the remaining ostree objects and file data. From the splitstream we can create an image. When pulling a commit, base commits (i.e. "the previous version" can be specified, either manually and/or added automatically based on parent commit or previous commit for the pulled ref. Any objects in that base commit will not be downloaded. Commits are splitstreams named ostree-commit-xxxx, and refs that points to these are refs/ostree/$ref. erofs images are automatically created for pulled commits, and they can be mounted with "cfsctl ostree mount". There are also some other subcommands, that are simliar to those of oci: * dump * compute-id * inspect * tag * untag * images Signed-off-by: Alexander Larsson <alexl@redhat.com> Assisted-by: Claude Code (Opus 4.6)
efba46d to
e966ce5
Compare
Based on ideas from #141
This is an initial version of ostree support. This allows pulling
from local and remote ostree repos, which will create a set of
regular file content objects, as well as a blob containing all the
remaining ostree objects. From the blob we can create an image.
When pulling a commit, a base blob (i.e. "the previous version" can be
specified. Any objects in that base blob will not be downloaded. If a
name is given for the pulled commit, then pre-existing blobs with the
same name will automatically be used as a base blob.
This is an initial version and there are several things missing: