Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce subassembly offset output artifact #15710

Draft
wants to merge 4 commits into
base: develop
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 21 additions & 0 deletions libevmasm/Assembly.cpp
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be implemented for EOF as well (i.e. in assembleEOF(), or, if possible, just in assemble() covering both with the same code).

Since we're at the stage where EOF is passing semantic tests (and they will be enabled by default quite soon), we should start requiring all new features for work on EOF as well.

Original file line number Diff line number Diff line change
Expand Up @@ -1203,6 +1203,7 @@ LinkerObject const& Assembly::assembleLegacy() const

size_t subTagSize = 1;
std::map<u256, LinkerObject::ImmutableRefs> immutableReferencesBySub;
std::vector<LinkerObject> subAssemblies;
for (auto const& sub: m_subs)
{
auto const& linkerObject = sub->assemble();
Expand All @@ -1215,6 +1216,7 @@ LinkerObject const& Assembly::assembleLegacy() const
);
immutableReferencesBySub = linkerObject.immutableReferences;
}
subAssemblies.push_back(linkerObject);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The linker object is already available as m_assembledObject on each Assembly in m_subs. No need to create another copy.

for (size_t tagPos: sub->m_tagPositionsInBytecode)
if (tagPos != std::numeric_limits<size_t>::max() && numberEncodingSize(tagPos) > subTagSize)
subTagSize = numberEncodingSize(tagPos);
Expand Down Expand Up @@ -1465,6 +1467,16 @@ LinkerObject const& Assembly::assembleLegacy() const
bytesRef r(ret.bytecode.data() + pos, bytesPerDataRef);
toBigEndian(ret.bytecode.size(), r);
}

std::vector<LinkerObject::SubAssembly> nestedSubAssemblies{};
for (auto const& subAssembly: subAssemblies)
nestedSubAssemblies.insert(nestedSubAssemblies.end(), subAssembly.subAssemblyData.begin(), subAssembly.subAssemblyData.end());

size_t const currentBytecodeSize = ret.bytecode.size();
updateSubAssemblyStartOffsets(nestedSubAssemblies, currentBytecodeSize);

ret.subAssemblyData.push_back({0, ret.bytecode.size(), isCreation(), ret.toHex(), nestedSubAssemblies});
Comment on lines +1471 to +1478
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calculating subassembly locations like this seems inherently flawed to me. You're not accounting for data or any future changes of the code above that might result in something new being inserted between assemblies. You're also ignoring deduplication that may result in overlapping assemblies (that needs a test case as well).

I think this calculation should be instead done above, in the loop that goes over subRefs. There you know for sure at which point in the bytecode the subassembly gets inserted (it's simply ret.bytecode.size()).


return ret;
}

Expand Down Expand Up @@ -1776,3 +1788,12 @@ Assembly::OptimiserSettings Assembly::OptimiserSettings::translateSettings(front
asmSettings.evmVersion = _evmVersion;
return asmSettings;
}

void Assembly::updateSubAssemblyStartOffsets(std::vector<LinkerObject::SubAssembly>& _subAssemblies, size_t const _currentBytecodeSize) const
{
for (auto& subAssembly: _subAssemblies)
{
subAssembly.start = _currentBytecodeSize - subAssembly.length;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm either missing something or this assumes that all subassemblies overlap and extend to the end of their parent assembly. Does it even work with more than one subassembly? If you have two subassemblies of the same length then you will end up with the same start location for both. And even if they're of different lengths it will be wrong.

EDIT: Yeah, you even have a test showing this overlap (standard_subassembly_offsets/):

                                        {
                                            "isCreation": false,
                                            "length": 780,
                                            "start": 1007
                                        },
                                        {
                                            "isCreation": true,
                                            "length": 130,
                                            "start": 1657,
                                            "subs": [

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should also have some asserts here. At the very least that a subassembly does not stick outside of its parent assembly.

updateSubAssemblyStartOffsets(subAssembly.subs, _currentBytecodeSize);
}
}
2 changes: 2 additions & 0 deletions libevmasm/Assembly.h
Original file line number Diff line number Diff line change
Expand Up @@ -287,6 +287,8 @@ class Assembly
[[nodiscard]] bytes assemblePushDeployTimeAddress() const;
[[nodiscard]] bytes assembleTag(AssemblyItem const& _item, size_t _pos, bool _addJumpDest) const;

void updateSubAssemblyStartOffsets(std::vector<LinkerObject::SubAssembly>& _subAssemblies, size_t const _currentBytecodeSize) const;

protected:
/// 0 is reserved for exception
unsigned m_usedTags = 1;
Expand Down
10 changes: 10 additions & 0 deletions libevmasm/LinkerObject.h
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,16 @@ struct LinkerObject
/// Bytecode offsets of named tags like function entry points.
std::map<std::string, FunctionDebugData> functionDebugData;

struct SubAssembly {
size_t start;
size_t length;
bool isCreation;
std::string bytecode;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not the same thing as bytecode stored directly in LinkerObject. The name should contain enough information to make that clear:

Suggested change
std::string bytecode;
std::string linkedBytecodeHex;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, why do you even store it? It's relatively large (on the same order as the size of the whole LinkerObject) and all you ever do with it is check its size (which you could get from length instead).

std::vector<SubAssembly> subs {};
};

std::vector<SubAssembly> subAssemblyData;

/// Appends the bytecode of @a _other and incorporates its link references.
void append(LinkerObject const& _other);

Expand Down
38 changes: 37 additions & 1 deletion libsolidity/interface/StandardCompiler.cpp
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add the feature to the CLI too. We should keep them at parity (and it's a pain for testing/development if the only way to access a feature is through StandardJSON).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, Yul compilation is not covered. Aside from general inconsistency, this makes two-step compilation less powerful, which may be a problem for future parallelization.

Original file line number Diff line number Diff line change
Expand Up @@ -237,7 +237,7 @@ bool isArtifactRequested(Json const& _outputSelection, std::string const& _file,
std::vector<std::string> evmObjectComponents(std::string const& _objectKind)
{
solAssert(_objectKind == "bytecode" || _objectKind == "deployedBytecode", "");
std::vector<std::string> components{"", ".object", ".opcodes", ".sourceMap", ".functionDebugData", ".generatedSources", ".linkReferences"};
std::vector<std::string> components{"", ".object", ".opcodes", ".sourceMap", ".functionDebugData", ".generatedSources", ".linkReferences", ".subAssemblyOffsets"};
if (_objectKind == "deployedBytecode")
components.push_back(".immutableReferences");
return util::applyMap(components, [&](auto const& _s) { return "evm." + _objectKind + _s; });
Expand Down Expand Up @@ -374,6 +374,24 @@ Json formatImmutableReferences(std::map<u256, evmasm::LinkerObject::ImmutableRef
return ret;
}

Json formatSubAssemblyOffsets(std::vector<evmasm::LinkerObject::SubAssembly> const& _subAssemblyOffsets)
{
Json subs = Json::array();

for (auto const& subAssembly: _subAssemblyOffsets)
{
Json subAssemblyInfo;
subAssemblyInfo["start"] = Json::number_unsigned_t(subAssembly.start);
subAssemblyInfo["length"] = Json::number_unsigned_t(subAssembly.length);
subAssemblyInfo["isCreation"] = Json::boolean_t(subAssembly.isCreation);
if (!subAssembly.subs.empty())
subAssemblyInfo["subs"] = formatSubAssemblyOffsets(subAssembly.subs);
subs.emplace_back(subAssemblyInfo);
}

return subs;
}

std::optional<Json> checkKeys(Json const& _input, std::set<std::string> const& _keys, std::string const& _name)
{
if (!_input.empty() && !_input.is_object())
Expand Down Expand Up @@ -1235,6 +1253,12 @@ Json StandardCompiler::importEVMAssembly(StandardCompiler::InputsAndSettings _in
creationJSON["functionDebugData"] = formatFunctionDebugData(stack.object(sourceName).functionDebugData);
if (evmCreationArtifactRequested("linkReferences"))
creationJSON["linkReferences"] = formatLinkReferences(stack.object(sourceName).linkReferences);
if (evmCreationArtifactRequested("subAssemblyOffsets"))
{
Json ret = Json::object();
ret["subs"] = formatSubAssemblyOffsets(stack.runtimeObject(sourceName).subAssemblyData);
creationJSON["subAssemblyOffsets"] = std::move(ret);
}
evmData["bytecode"] = creationJSON;
}

Expand Down Expand Up @@ -1503,6 +1527,12 @@ Json StandardCompiler::compileSolidity(StandardCompiler::InputsAndSettings _inpu
creationJSON["linkReferences"] = formatLinkReferences(compilerStack.object(contractName).linkReferences);
if (evmCreationArtifactRequested("generatedSources"))
creationJSON["generatedSources"] = compilerStack.generatedSources(contractName, /* _runtime */ false);
if (evmCreationArtifactRequested("subAssemblyOffsets"))
{
Json ret = Json::object();
ret["subs"] = formatSubAssemblyOffsets(compilerStack.object(contractName).subAssemblyData);
creationJSON["subAssemblyOffsets"] = std::move(ret);
}
Comment on lines +1531 to +1535
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just FYI, with nlohmann-json you can create JSON objects in place:

Suggested change
{
Json ret = Json::object();
ret["subs"] = formatSubAssemblyOffsets(compilerStack.object(contractName).subAssemblyData);
creationJSON["subAssemblyOffsets"] = std::move(ret);
}
creationJSON["subAssemblyOffsets"] = {
{"subs", formatSubAssemblyOffsets(compilerStack.object(contractName).subAssemblyData)}
};

We should generally use that style whenever possible, because the intermediate variables does not add anything of value, especially one with a horrible name like ret. And the reversed order always makes things harder to follow.

evmData["bytecode"] = creationJSON;
}

Expand Down Expand Up @@ -1533,6 +1563,12 @@ Json StandardCompiler::compileSolidity(StandardCompiler::InputsAndSettings _inpu
deployedJSON["immutableReferences"] = formatImmutableReferences(compilerStack.runtimeObject(contractName).immutableReferences);
if (evmDeployedArtifactRequested("generatedSources"))
deployedJSON["generatedSources"] = compilerStack.generatedSources(contractName, /* _runtime */ true);
if (evmDeployedArtifactRequested("subAssemblyOffsets"))
{
Json ret = Json::object();
ret["subs"] = formatSubAssemblyOffsets(compilerStack.object(contractName).subAssemblyData);
deployedJSON["subAssemblyOffsets"] = std::move(ret);
}
evmData["deployedBytecode"] = deployedJSON;
}

Expand Down
5 changes: 4 additions & 1 deletion test/cmdlineTests/standard_import_asm_json/output.json
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,10 @@
"linkReferences": {},
"object": "<BYTECODE REMOVED>",
"opcodes":"<OPCODES REMOVED>",
"sourceMap":"<SOURCEMAP REMOVED>"
"sourceMap":"<SOURCEMAP REMOVED>",
"subAssemblyOffsets": {
"subs": []
}
},
"deployedBytecode": {
"functionDebugData": {},
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,16 @@
"linkReferences": {},
"object": "<BYTECODE REMOVED>",
"opcodes":"<OPCODES REMOVED>",
"sourceMap":"<SOURCEMAP REMOVED>"
"sourceMap":"<SOURCEMAP REMOVED>",
"subAssemblyOffsets": {
"subs": [
{
"isCreation": false,
Copy link
Member

@cameel cameel Jan 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, I wonder why this isn't true. Does exporting and reimporting asm JSON lose the creation status or is it just because of how the artificial input was crafted? If the status gets lost, this would be a bug (and should be reported).

"length": 87,
"start": 0
}
]
}
Comment on lines +12 to +20
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The input file has two assemblies but the output shows only one. Is this a bug in your feature or an instance of #15725 (because the nested assembly is unreferenced)?

},
"deployedBytecode": {
"functionDebugData": {},
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,16 @@
"linkReferences": {},
"object": "<BYTECODE REMOVED>",
"opcodes":"<OPCODES REMOVED>",
"sourceMap": ""
"sourceMap": "",
"subAssemblyOffsets": {
"subs": [
{
"isCreation": false,
"length": 75,
"start": 0
}
]
}
},
"deployedBytecode": {
"functionDebugData": {},
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,23 @@
"linkReferences": {},
"object": "<BYTECODE REMOVED>",
"opcodes":"<OPCODES REMOVED>",
"sourceMap":"<SOURCEMAP REMOVED>"
"sourceMap":"<SOURCEMAP REMOVED>",
"subAssemblyOffsets": {
"subs": [
{
"isCreation": true,
"length": 130,
"start": 0,
"subs": [
{
"isCreation": false,
"length": 104,
"start": 26
}
]
}
]
}
Comment on lines +13 to +28
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, is Sourcify ok with getting only subassembly locations? I've always been thinking about metadata as a separate subassembly myself, because it's separate data Object in Yul, but looking at the PR I remembered that it's actually not the case at evmasm level. Metadata goes into Assembly::m_auxiliaryData and logically becomes a part of each assembly's bytecode, not a separate object. It's simply appended after all the subs and other data. I think that due to this it will still be necessary to use heuristics to fish out its location within the assembly, even knowing location of all subassemblies.

I think we may need to include the start and length of Assembly::m_auxiliaryData separately for each sub (when it's non-empty). For completeness we may want to simply list all the data chunks (that would actually have been nice to have for compiler debugging in some cases).

CC @kuzdogan.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, we should have some tests that include data objects between assemblies.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if we had the exact location of metadata, we could create a more robust, Boost-based test that would get the structure info and the bytecode and diff the CBOR bit. It's not easy to spot a problem just looking at the values in command-line tests and we're not even shown the bytecode.

}
}
}
Expand Down
18 changes: 17 additions & 1 deletion test/cmdlineTests/standard_no_append_cbor/output.json
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,23 @@
"linkReferences": {},
"object": "<BYTECODE REMOVED>",
"opcodes":"<OPCODES REMOVED>",
"sourceMap":"<SOURCEMAP REMOVED>"
"sourceMap":"<SOURCEMAP REMOVED>",
"subAssemblyOffsets": {
"subs": [
{
"isCreation": true,
"length": 27,
"start": 0,
"subs": [
Comment on lines +13 to +19
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The top-level assembly is not nested in anything. Seems weird to put its data under 'subs'. It should be at the top level. Similarly with subAssemblyOffsets - it should rather be assemblyOffsets since it covers the whole binary, including the top.

Finally, we might end up including more info there later so maybe assemblyStructure would be a more future-proof name?

Suggested change
"subAssemblyOffsets": {
"subs": [
{
"isCreation": true,
"length": 27,
"start": 0,
"subs": [
"assemblyStructure": {
"isCreation": true,
"length": 27,
"start": 0,
"subassemblies": [

I'd use that name for LinkerObject::SubAssembly too, i.e. LinkerObject::Structure, which would nicely mirror Object::Structure which is similar in the sense that it stores metadata about the hierarchy (but on Yul level).

{
"isCreation": false,
"length": 3,
"start": 24
}
]
}
]
}
}
}
}
Expand Down
19 changes: 19 additions & 0 deletions test/cmdlineTests/standard_subassembly_offsets/input.json
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, we also need some coverage for optimized compilation.

Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
{
"language": "Solidity",
"sources": {
"contract.sol": {
"content": "// SPDX-License-Identifier: GPL-3.0\npragma solidity >=0.7.0 <0.9.0;\ncontract A {}\ncontract Storage { uint256 public number; bytes public code = type(A).creationCode; function store(uint256 num) public { number = num; }}"
}
},
"settings": {
"evmVersion": "cancun",
"outputSelection": {
"*": {
"Storage": [
"evm.bytecode.subAssemblyOffsets"
]
}
}
Comment on lines +10 to +16
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're not testing evm.deployedBytecode.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, bad indentation. And I'd put the contract in a separate file.

}
}

45 changes: 45 additions & 0 deletions test/cmdlineTests/standard_subassembly_offsets/output.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
{
"contracts": {
"contract.sol": {
"Storage": {
"evm": {
"bytecode": {
"subAssemblyOffsets": {
"subs": [
{
"isCreation": true,
"length": 1787,
"start": 0,
"subs": [
{
"isCreation": false,
"length": 780,
"start": 1007
},
{
"isCreation": true,
"length": 130,
"start": 1657,
"subs": [
{
"isCreation": false,
"length": 104,
"start": 1683
}
]
}
]
}
]
}
}
}
}
}
},
"sources": {
"contract.sol": {
"id": 0
}
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,10 @@
"linkReferences": {},
"object": "",
"opcodes": "",
"sourceMap": ""
"sourceMap": "",
"subAssemblyOffsets": {
"subs": []
}
},
"deployedBytecode": {
"functionDebugData": {},
Expand All @@ -25,7 +28,10 @@
"linkReferences": {},
"object": "",
"opcodes": "",
"sourceMap": ""
"sourceMap": "",
"subAssemblyOffsets": {
"subs": []
}
},
"gasEstimates": null,
"legacyAssembly": null,
Expand Down Expand Up @@ -65,7 +71,10 @@
"linkReferences": {},
"object": "",
"opcodes": "",
"sourceMap": ""
"sourceMap": "",
"subAssemblyOffsets": {
"subs": []
}
},
"deployedBytecode": {
"functionDebugData": {},
Expand All @@ -74,7 +83,10 @@
"linkReferences": {},
"object": "",
"opcodes": "",
"sourceMap": ""
"sourceMap": "",
"subAssemblyOffsets": {
"subs": []
}
},
"gasEstimates": null,
"legacyAssembly": null,
Expand Down