* [tvix] string contexts vs. reference scanning
[not found] ` <20221202152213.3a59e629@ostraka>
@ 2023-01-09 22:07 ` Vincent Ambo
2023-01-11 11:49 ` sterni
2023-03-16 9:41 ` Vincent Ambo
2023-01-10 20:20 ` reference-scanning inputDrvs/inputSrcs Adam Joseph
1 sibling, 2 replies; 7+ messages in thread
From: Vincent Ambo @ 2023-01-09 22:07 UTC (permalink / raw)
To: Adam Joseph, depot; +Cc: Lukas Epple, Griffin Smith, Florian Klink
[this mail thread is started out of another mail conversation]
We are currently implementing `builtins.derivationStrict`, which
translates the string context in evaluator values to the inputDrvs and
inputSrcs fields in a derivation struct in C++ Nix.
Florian and I have been "documenting"[0] our understanding of the code
that does this in Nix as we go along.
As amjoseph has previously said, string contexts are likely unnecessary:
Adam Joseph <adam@westernsemico•com> writes:
> Related to that second reason: I've written a tool (using the awesome
> libnixstore crate) that calculates drv.inputDrvs using scanForReferences
> instead of string contexts, and compares the calculated results to what
> cppnix produced. I ran it on the 144,586 derivations in my buildfarm's
> store, and (after fixing a lot of bugs in my scanner) there was only a
> single disagreement in the entire store, which IMHO is better described
> as a revealed bug in nixpkgs
Because of this, We plan to use reference scanning over derivation input
attributes instead of string contexts, but may need to address the
problem of implementing unsafeDiscardStringContext to support certain
use cases[1].
Adam, would you be up for pushing a WIP CL (or a commit straight to
//users/amjoseph!) with the scanner you've written (and esp. with your
bugfixes!) that we could then figure out how to integrate? It might help
us avoid some overhead of running into the same bugs again :)
Any other input people might have on string contexts is also welcome!
[0]: https://github.com/tvlfyi/nix/commit/wtf-are-contexts
[1]: I'm not actually sure about this. It's possible that all these
use-cases that exist right now (e.g. string context discarding in TVL's
:llama: step) actually go away with the Tvix model of starting builds
immediately, but strongly ordered. Thoughts?
^ permalink raw reply [flat|nested] 7+ messages in thread
* reference-scanning inputDrvs/inputSrcs
[not found] ` <20221202152213.3a59e629@ostraka>
2023-01-09 22:07 ` [tvix] string contexts vs. reference scanning Vincent Ambo
@ 2023-01-10 20:20 ` Adam Joseph
2023-01-10 20:48 ` Vincent Ambo
1 sibling, 1 reply; 7+ messages in thread
From: Adam Joseph @ 2023-01-10 20:20 UTC (permalink / raw)
To: depot
Quoting Adam Joseph (2022-12-02 15:22:13)
> I've written a tool (using the awesome libnixstore crate) that calculates
> drv.inputDrvs using scanForReferences instead of string contexts, and compares
> the calculated results to what cppnix produced.... In a week or two I'm
> going to clean it up and write a post about it on the nixos discourse
Just want to mention that I haven't forgotten about this, but it is on hold for
a while because of two things:
1. Much as I loathe C++, I should implement this in cppnix first rather than in
tvix because:
a. If I don't get it exactly right on the first try, tvix will get the blame
for any behavioral differences, and that is not cool. The implementation
is straightforward and obvious except for placeholders and
unsafeDiscardOutputDependency -- those are subtle, and I can't be sure
I'll get them right on the first try.
b. Cppnix is already in production use. It can calculates scanned-inputDrvs
*in addition to* string-context inputDrvs, and emit a warning if they
disagree. Having no warnings on a large production codebase (nixpkgs,
maybe even Hydra) for a few months is the sign that it ready for an RFC.
2. Nix will need scanned-inputDrvs anyways in order for Ericson2314's "EBB"
derivations-that-emit-derivations-without-using-nixlang to be practically
useful. So I am going to wait for nix#3959 to merge before I try to
implement scanned-inputDrvs for cppnix. I will be much easier to convince
people of its usefulness at that point.
https://github.com/NixOS/nix/pull/3959
https://github.com/NixOS/nix/pull/3959#issuecomment-1375508978
You may not hear much from me for the next month or two. I have been dragged
into a really enormous project that I have been wanting to do for a long time:
https://github.com/NixOS/nixpkgs/pull/209870
Getting such a major change approved only happens if it resolves some kind of
excruciating pain. Fortunately that excruciating pain is present right now:
https://github.com/NixOS/nixpkgs/issues/108305
So #209870 is my top priority right now because at the moment it is very easy to
convince people that it is useful.
- a
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: reference-scanning inputDrvs/inputSrcs
2023-01-10 20:20 ` reference-scanning inputDrvs/inputSrcs Adam Joseph
@ 2023-01-10 20:48 ` Vincent Ambo
0 siblings, 0 replies; 7+ messages in thread
From: Vincent Ambo @ 2023-01-10 20:48 UTC (permalink / raw)
To: Adam Joseph, depot
On 1/10/23 23:20, Adam Joseph wrote:
> If I don't get it exactly right on the first try, tvix will get the blame
> for any behavioral differences, and that is not cool.
I think we have thick enough skin to deal with this ;) My plan is to go
ahead anyways and implement this string-scanning based solution. We're
fairly close to producing arbitrary derivation hashes, and as we're
serialising derivations into the same format as C++ Nix we'll be able to
use existing diff tools to figure out discrepancies when computing
hashes over the whole package set.
I think this will allow us to figure out problems quite quickly! We can
always resort to implementing string contexts if necessary.
> The implementation is straightforward and obvious except for placeholder > and unsafeDiscardOutputDependency
placeholders should fit into how we're envisioning the rest of the
machinery pretty well, we'll see.
context discarding builtins are interesting, I have a working theory
that the use-cases in which they are currently used simply do not apply
in Tvix (and we can implement them as no-ops that emit warnings), and
will probably write up another post about that.
> I have been dragged into a really enormous project that I have been wanting
> to do for a long time:
>
> https://github.com/NixOS/nixpkgs/pull/209870
Cool to see work on bootstrap infrastructure. Hoping we can (in the not
so far future) also collaborate with the Guix folks on that ...
Good luck with the project!
//V
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [tvix] string contexts vs. reference scanning
2023-01-09 22:07 ` [tvix] string contexts vs. reference scanning Vincent Ambo
@ 2023-01-11 11:49 ` sterni
2023-01-11 12:20 ` Vincent Ambo
2023-03-16 9:41 ` Vincent Ambo
1 sibling, 1 reply; 7+ messages in thread
From: sterni @ 2023-01-11 11:49 UTC (permalink / raw)
To: Vincent Ambo, depot
On 1/9/23 23:07, Vincent Ambo wrote:
> Any other input people might have on string contexts is also welcome!
One thing I'd want to see answered is how to handle import from
derivation. In current C++ Nix this is handled in the following way:
import calls coerceToPath on the value it gets passed. coerceToPath
looks at the string context and realises any derivation found within it.
Finally the actual file is retrieved from the store / disk.
There are also similar occasions where things get realised while
evaluating (interactively?) except reading / importing, but I don't have
a very good handle on those yet.
> [1]: I'm not actually sure about this. It's possible that all these
> use-cases that exist right now (e.g. string context discarding in TVL's
> :llama: step) actually go away with the Tvix model of starting builds
> immediately, but strongly ordered. Thoughts?
Currently you can work with string context in the following ways:
1. You can discard some using builtins.unsafeDiscardOutputDependency.
This has been, as far as I can tell, been added to combat the
oddity of the string context of the `drvPath` attribute.
Apparently disnix ran into the problem that the `drvPath`
of a derivation would cause all of its outputs to be built
in 2009. Subsequently, `builtins.unsafeDiscardOutputDependency`
was [introduced].
My _unconfirmed_ theory is that this was a quick and easy
workaround that was implemented without considering the underlying
problem. In my view, there is no reason why `drvPath` should
incur a reference to all outputs of the derivation as well as
the derivation file itself (I think this is thanks to the reference
scanner the store runs after the fact which determines if the
derivation and/or any of its outputs are actually referenced).
I would be interested in any theories why `drvPath` behaved
and maybe even should behave that way (maybe useful for recursive
Nix?). In my experience `drvPath` either never enters a derivation or
is closely accompanied by `builtins.unsafeDiscardOutputDependency`.
Maybe when implementing string contexts there was a confusion
what "=<drv_path>" should mean originally, but this was never
fixed when disnix came around.
2. You can discard all using builtins.unsafeDiscardStringContext.
I think the uses of this builtin fall into two categories:
- To drop wrongfully retained string context. All string
operations retain string context, even though some actually
destroy any reference that was present in the string.
Classic examples would be
`builtins.substring 0 3 ">>> ${pkgs.hello}"` or
`builtins.baseNameOf "${pkgs.hello}/bin/hello".
- As an escape hatch from references to the derivations
in question. We use this in //nix/buildkite: We use
derivation paths, so we can skip re-evaluating targets,
but discard any references to those files. Since buildkite
doesn't know about nix-copy-closure(1), it'd be difficult
to copy the required derivation files to an executing machine
even if we had the correct references. Instead we impurely
access the store and re-evaluate the target if the derivation
file is missing.
With reference scanning, wrongfully retained string context should
basically disappear, but so would the escape hatch. I think we'd
need to invent a new mechanism entirely, maybe even as ugly as
an `allowedImpureReferences = [ … ];`.
A third use is described in the item for appendContext.
3. You can check if there's any using builtins.hasContext.
4. You can query it using builtins.getContext.
In C++ Nix 2.3 this is not particularly useful, since you can
only inspect the root of the dependency graph (i.e. outPath
will just give you drvPath as context). In Nix >= 2.6 you can,
however, use this in conjunction with builtins.readFile
to query the references of store paths. This is of course already
possible before, but requires writing a reference scanner and/or
derivation parser in Nix. We have a hacky [prototype] of this
in depot.
I think it should be possible to emulate this using reference
scanning as well, i.e. tvix-eval would need to run the reference
scanner on the given string for getContext. This would actually
be pretty nice for doing dependency analysis.
5. You can add it using builtins.appendContext.
The main use case for `builtins.appendContext` I can think of,
is to restore string context after it has been discarded via
`builtins.unsafeDiscardStringContext`. This is required due
to technical limitations in C++ Nix that affect some algorithms:
If you, for example want to use (parts of) input strings
as keys in an attribute set, you need to make sure they have no
string context. A function that has such a step in its algorithm
would then use `builtins.getContext` to store the context,
run the actual algorithm after `builtins.unsafeDiscardStringContext`
is applied and finally return after restoring the string context
using `builtins.appendContext`.
[introduced]:
https://github.com/NixOS/nix/commit/437077c39dd7abb44b2ab02cb9c6215d125bef04
[prototype]:
https://code.tvl.fyi/tree/nix/dependency-analyzer/default.nix?id=805219a2fad0edac10d046fc5ad5820edb4482ee#n10
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [tvix] string contexts vs. reference scanning
2023-01-11 11:49 ` sterni
@ 2023-01-11 12:20 ` Vincent Ambo
0 siblings, 0 replies; 7+ messages in thread
From: Vincent Ambo @ 2023-01-11 12:20 UTC (permalink / raw)
To: sterni, depot
sterni <sternenseemann@systemli•org> writes:
> One thing I'd want to see answered is how to handle import from
> derivation.
Files enter an evaluation through the `EvalIO` interface, which is
passed the path(s) to read and returns the contents or store path.
In the actual implementation of `EvalIO` that supports store
interactions, store path access is detected and blocked/suspended until
build completion.
At the time that this calculation occurs, we already called
builtins.derivation or equivalent on the thing yielding that path, so
the build is already running.
> There are also similar occasions where things get realised while
> evaluating (interactively?) except reading / importing, but I don't have
> a very good handle on those yet.
I'm not sure what those would be, but there's nothing the evaluator can
do to read from disk that wouldn't go through EvalIO.
> 1. You can discard some using builtins.unsafeDiscardOutputDependency.
This doesn't really discard the context, it changes the _type_ of the
context entry. See cl/7807.
I believe that this is unnecessary, all uses of it in nixpkgs are a hack.
> My _unconfirmed_ theory is that this was a quick and easy workaround
> that was implemented without considering the underlying problem. In my
> view, there is no reason why `drvPath` should incur a reference to all
> outputs of the derivation as well as the derivation file itself (I
> think this is thanks to the reference scanner the store runs after the
> fact which determines if the derivation and/or any of its outputs are
> actually referenced).
Well, the drv file brings everything inside of it into scope (that
includes its output paths, which are part of the serialised
representation). Otherwise referencing a drv file would make "dead
paths" available in the builder.
It's conceptionally reasonable that this closure is yielded, I think. It
seems odd to me that it involves store database queries, though. All the
information should be *in* the drv file.
In Tvix, if a drv path is referenced then that drv was also part of the
evaluation, so we *should* already know all of that information without
running any queries.
> - To drop wrongfully retained string context. All string
> operations retain string context, even though some actually
> destroy any reference that was present in the string.
This is basically a workaround for how contexts work in C++ Nix. It's
not a problem if contexts are replaced by scanning.
> - As an escape hatch from references to the derivations in question.
> We use this in //nix/buildkite:
Yes, but we use it to work around a problem that we only have because of
the model of C++ Nix. In Tvix this whole thing would work differently
(the evalution itself drives the creation of the build targets).
We'd essentially have a buildkite-driver that uses Tvix to evaluate and
yields the build targets as necessary, using a store (local or remote)
as the synchronisation point.
The other uses are covered in https://cl.tvl.fyi/7807 and I think fairly
uncontroversial.
My gut feeling is that we can get away with not implementing any of the
destructive string context operations, and if we _really_ end up needing
unsafeDiscardStringContext, it's not terribly difficult to implement
(could be a `Value` variant containing another Value that just acts as a
marker).
//V
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [tvix] string contexts vs. reference scanning
2023-01-09 22:07 ` [tvix] string contexts vs. reference scanning Vincent Ambo
2023-01-11 11:49 ` sterni
@ 2023-03-16 9:41 ` Vincent Ambo
2023-03-16 12:00 ` Florian Klink
1 sibling, 1 reply; 7+ messages in thread
From: Vincent Ambo @ 2023-03-16 9:41 UTC (permalink / raw)
To: Adam Joseph, depot; +Cc: Florian Klink
Okay, there's some (for me) new information on string contexts that will
require some careful handling if we want to proceed with reference scanning.
Let me preface by saying that despite this problem, reference-scanning
for inputs yields perfectly functional, *equivalent* but not *identical*
derivations to Nix. We do currently consider it a problem because we
want to be fully hash-equal with C++ Nix, both to prove that our
implementation is correct and to make use of Hydra's cache.
Now, for the problem:
There are some real-life cases, for example during nixpkgs
bootstrapping, where multiple different fixed-output derivations are
written to produce the same hash.
For example, bootstrap sources that are downloaded early are fetched
using a special "builder hack", in which the `builder` field of the
derivation is populated with the magic string `builtins:fetchurl` and
the builder itself will perform a fetch, with everything looking like a
normal derivation to the user.
These bootstrap sources are later on defined *again*, once `curl` is
available, to be downloaded using the standard pkgs.fetchtarball
mechanism, but yielding the *same* outputs (as the same files are being
fetched).
In our reference scanning implementation, this output scanning of FOD
will yield whatever the *first* derivation was that produced the given
path as the drv to be stored in the `inputDrvs` field of the derivation.
There's an orthogonal problem which made this confusing to understand,
where C++ Nix has some special logic for how it hashes derivations that
use fixed-output paths, which we haven't fully replicated yet. This led
to hash differences which masked this underlying problem (the
differences still exist, and are a separate issue).
I discussed this with Adam yesterday and he suggested an approach
similar to `builtins.placeholder`, which would only be in effect for
fixed-output derivations. I think this is feasible but haven't sketched
anything yet.
Either way, for me this also raises the thought of whether we should
decouple Tvix's internal representation of a derivation from that of C++
Nix and only "materialise" C++ Nix derivations (and accompanying hashes)
where needed. Something to think about ...
//V
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [tvix] string contexts vs. reference scanning
2023-03-16 9:41 ` Vincent Ambo
@ 2023-03-16 12:00 ` Florian Klink
0 siblings, 0 replies; 7+ messages in thread
From: Florian Klink @ 2023-03-16 12:00 UTC (permalink / raw)
To: Vincent Ambo; +Cc: Adam Joseph, depot
[…]
>Let me preface by saying that despite this problem, reference-scanning
>for inputs yields perfectly functional, *equivalent* but not
>*identical* derivations to Nix. We do currently consider it a problem
>because we want to be fully hash-equal with C++ Nix, both to prove
>that our implementation is correct and to make use of Hydra's cache.
[…]
>There's an orthogonal problem which made this confusing to understand,
>where C++ Nix has some special logic for how it hashes derivations
>that use fixed-output paths, which we haven't fully replicated yet.
>This led to hash differences which masked this underlying problem (the
>differences still exist, and are a separate issue).
>
>I discussed this with Adam yesterday and he suggested an approach
>similar to `builtins.placeholder`, which would only be in effect for
>fixed-output derivations. I think this is feasible but haven't
>sketched anything yet.
I'm not sure this will be a problem in practice, at least when it comes
to "being able to substitute from binary caches".
For both fixed-output derivations with recursive and flat hashes, the
ATerm of the derivation we finally ended up picking is not ending up in
the info that's used to calculate the output hash.
In the case of a recursive sha256 FOD, a
`source:sha256:$narDigest:$storeDir:$outputName` is used as a
fingerprint:
https://cs.tvl.fyi/depot@def2d32319e5d70b21a799d463d07d880f5ff207/-/blob/tvix/nix-compat/src/derivation/mod.rs?L256#tab=references
In all other cases, a `derivation_or_fod_hash` is used, and baked into
a `output:$outputName:$derivation_or_fod_hash:$storeDir:$outputName`
fingerprint:
https://cs.tvl.fyi/depot@def2d32319e5d70b21a799d463d07d880f5ff207/-/blob/tvix/nix-compat/src/derivation/mod.rs?L269#tab=def
However, for /all/ FODs, `derivation_or_fod_hash` is calculated by the
`fod_digest()` function:
https://cs.tvl.fyi/depot@def2d32319e5d70b21a799d463d07d880f5ff207/-/blob/tvix/nix-compat/src/derivation/mod.rs?L172#tab=def
… which internally uses a
`fixed:out:$hashWithModeAsNixHashString:$outputPath` as inputs into the
sha256 function.
This means, even if we end up mapping the "wrong" fixed-output
derivation somewhere, the output hashes will still be the same.
And as lookups from the binary cache ask for the "compressed hash" /
`StorePath.digest` [^1] only, and the path of the .drv doesn't matter,
it shouldn't be a problem when it comes to substitute from the binary
cache.
So I'm not sure if all this placeholder logic is really necessary to
implement at all.
VWe can compare evaluation to be equivalent by simply calculating the
resulting output paths, both for Tvix and Nix, and use this to validate
the evaluator does the same.
>Either way, for me this also raises the thought of whether we should
>decouple Tvix's internal representation of a derivation from that of
>C++ Nix and only "materialise" C++ Nix derivations (and accompanying
>hashes) where needed. Something to think about ...
I personally don't think we want to, or should materialize .drv files in
/nix/store at all. The usecases for this that I've heard about so far
are mostly attempts to schedule /around the Nix scheduler/, or somehow
extract "build closures".
We didn't recently discuss much on how we want a Tvix scheduler/builder
interface to look like, but I personally would like it to treat some of
the Nix-specific details in a very opaque fashion.
It doesn't need to know how, e.g., the Nix sandbox env vars are
populated, or how some desired output paths have been calculated. It can
treat these as opaque strings to pass along to the build environment, or
to set up mountpoints as such, but doesn't understand how they are
produced.
These values can be calculated/produced by the thing that still has all
the context (pun intended) about Derivation construction.
That way, the builder would be more generic, and we could also play with
different path calculation modes without having to touch its
implementation.
--
flokli
[^1]: https://cs.tvl.fyi/depot@def2d32319e5d70b21a799d463d07d880f5ff207/-/blob/tvix/nix-compat/src/store_path.rs?L40
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2023-03-16 12:00 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <CANHrikrEDPkH1raGDGAGETeATrWOJ=sBQCUXr6=pHJm1ajbd0A@mail.gmail.com>
[not found] ` <20221202152213.3a59e629@ostraka>
2023-01-09 22:07 ` [tvix] string contexts vs. reference scanning Vincent Ambo
2023-01-11 11:49 ` sterni
2023-01-11 12:20 ` Vincent Ambo
2023-03-16 9:41 ` Vincent Ambo
2023-03-16 12:00 ` Florian Klink
2023-01-10 20:20 ` reference-scanning inputDrvs/inputSrcs Adam Joseph
2023-01-10 20:48 ` Vincent Ambo
Code repositories for project(s) associated with this public inbox
https://code.tvl.fyi
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).