refactor: job isolation done #204

Mohiiit · 2025-01-22T06:14:51Z

files and information needed by each job should be provided by the worker of the respective job, no inter-job dependency

apoorvsadana · 2025-01-22T09:59:59Z

crates/orchestrator/src/constants.rs

@@ -2,3 +2,10 @@ pub const BLOB_DATA_FILE_NAME: &str = "blob_data.txt";
 pub const SNOS_OUTPUT_FILE_NAME: &str = "snos_output.json";
 pub const PROGRAM_OUTPUT_FILE_NAME: &str = "program_output.txt";
 pub const CAIRO_PIE_FILE_NAME: &str = "cairo_pie.zip";
+
+pub const JOB_METADATA_CAIRO_PIE_PATH: &str = "cairo_pie_path";


we've a lot of metadata constants here. should we reuse this?

apoorvsadana · 2025-01-22T10:11:00Z

crates/orchestrator/src/jobs/da_job/mod.rs

@@ -133,7 +133,11 @@ impl Job for DaJob {
        // data transformation on the data
        tracing::trace!(job_id = ?job.id, "Applied FFT transformation");

-        store_blob_data(transformed_data.clone(), block_no, config.clone()).await?;
+        let blob_data_path = format!("{}/{}", block_no, BLOB_DATA_FILE_NAME);


let's change to worker specifying full path as discussed

apoorvsadana · 2025-01-22T10:23:10Z

crates/orchestrator/src/jobs/snos_job/mod.rs

@@ -196,17 +218,17 @@ impl SnosJob {
            SnosError::SnosOutputUnstorable { internal_id: internal_id.to_string(), message: e.to_string() }
        })?;

+        let program_output_key = format!("{block_number}/{PROGRAM_OUTPUT_FILE_NAME}");


will come from metadata now?

apoorvsadana · 2025-01-22T10:36:23Z

crates/orchestrator/src/jobs/state_update_job/mod.rs


-        let program_output = storage_client.get_data(&key).await.map_err(|e| JobError::Other(OtherError(e)))?;
+        // Get the array of program output paths from metadata
+        let program_paths: Vec<String> = serde_json::from_str(


pass path directly

as discussed, should we use config structs for metadata fields

apoorvsadana · 2025-01-24T07:37:57Z

crates/orchestrator/src/database/mongodb/mod.rs

-                ORCHESTRATOR_METRICS.db_calls_response_time.record(duration.as_secs_f64(), &attributes);
-                Ok(Some(job))
+                // Add debug logging to see the raw document
+                tracing::info!(raw_document = ?doc, "Raw document from MongoDB");


Suggested change

tracing::info!(raw_document = ?doc, "Raw document from MongoDB");

tracing::debug!(raw_document = ?doc, "Raw document from MongoDB");

apoorvsadana · 2025-01-24T07:41:42Z

crates/orchestrator/src/database/mongodb/mod.rs

+                tracing::info!(raw_document = ?doc, "Raw document from MongoDB");
+
+                // Try to deserialize and log any errors
+                match mongodb::bson::from_document::<JobItem>(doc.clone()) {


we're cloning here so that we can log in error later? if yes, should we explain that's why we're cloning it?

apoorvsadana · 2025-01-24T07:50:37Z

crates/orchestrator/src/jobs/metadata.rs

this is also types to be precise. maybe we should add this to types.rs OR create a types folder with jobs.rs and metadata.rs?

Yes, something like :

src/ types/ mod.rs // Re-exports all types job.rs // Job-related types common.rs // Common types

apoorvsadana · 2025-01-24T07:57:08Z

crates/orchestrator/src/jobs/metadata.rs

+pub struct ProvingMetadata {
+    // Required fields
+    pub block_number: u64,
+    pub cairo_pie_path: Option<String>,


if it's required it should not be optional right?
why does proving job care about block_number?
cross_verify can be renamed to ensure_on_chain_registration or something more understandable
let's delete verification_key_path

make an input field which is

enum Input {
Pie()...
}

can remove snos_fact and club it with ensure_on_chain_registration using Optional

also add comments to explain what are inputs in metadata and what is added by jobs on the fly

apoorvsadana · 2025-01-24T08:06:14Z

crates/orchestrator/src/jobs/metadata.rs

+
+    // State tracking
+    pub last_failed_block_no: Option<u64>,
+    pub tx_hashes: Vec<String>, // key: attempt_no, value: comma-separated tx hashes


Suggested change

pub tx_hashes: Vec<String>, // key: attempt_no, value: comma-separated tx hashes

pub tx_hashes: Vec<String>

apoorvsadana · 2025-01-24T08:37:06Z

crates/orchestrator/src/jobs/state_update_job/mod.rs

-            block_numbers = block_numbers.into_iter().filter(|&block| block >= last_failed_block).collect::<Vec<u64>>();
-        }
+        // Filter block numbers if there was a previous failure
+        let block_numbers = if let JobSpecificMetadata::StateUpdate(state_metadata) = &job.metadata.specific {


we can move this part outside

apoorvsadana · 2025-01-24T08:39:05Z

crates/orchestrator/src/jobs/state_update_job/mod.rs

-                .update_state_for_block(config.clone(), *block_no, snos, nonce)
-                .await
-                .map_err(|e| {
+            let snos = self.fetch_snos_for_block(i, config.clone(), job).await?;


get the snos path here itsefl

apoorvsadana · 2025-01-24T08:39:08Z

crates/orchestrator/src/jobs/state_update_job/mod.rs

-                .await
-                .map_err(|e| {
+            let snos = self.fetch_snos_for_block(i, config.clone(), job).await?;
+            let txn_hash = match self.update_state_for_block(config.clone(), *block_no, i, snos, nonce, job).await {


apoorvsadana · 2025-01-24T08:39:20Z

crates/orchestrator/src/jobs/state_update_job/mod.rs

-                .unwrap();
+
+                    // Update metadata in a single place
+                    if let JobSpecificMetadata::StateUpdate(ref mut state_metadata) = job.metadata.specific {


here as well

apoorvsadana · 2025-01-24T08:41:13Z

crates/orchestrator/src/jobs/state_update_job/utils.rs

+// Fetching the blob data (stored in remote storage during DA job) for a particular block
+// pub async fn fetch_program_data_for_block(block_number: u64, config: Arc<Config>, job: &JobItem)
+// -> color_eyre::Result<Vec<[u8; 32]>> {     let storage_client = config.storage();
+//     let key = block_number.to_string() + "/" + PROGRAM_OUTPUT_FILE_NAME;
+//     let blob_data = storage_client.get_data(&key).await?;
+//     let transformed_blob_vec_u8 = bytes_to_vec_u8(blob_data.as_ref())?;
+//     Ok(transformed_blob_vec_u8)
+// }


apoorvsadana · 2025-01-24T08:42:58Z

crates/orchestrator/src/workers/data_submission_worker.rs

+            let proving_metadata = match &proving_job.metadata.specific {
+                JobSpecificMetadata::Proving(metadata) => metadata,
+                _ => {
+                    tracing::error!(
+                        job_id = %proving_job.internal_id,
+                        "Invalid metadata type for proving job"
+                    );
+                    continue;
+                }
+            };


can just use internal id here

apoorvsadana · 2025-01-24T08:54:37Z

crates/orchestrator/src/workers/proving.rs

-                Ok(_) => tracing::info!(block_id = %job.internal_id, "Successfully created new proving job"),
+        for snos_job in successful_snos_jobs {
+            // Extract SNOS metadata
+            let snos_metadata = match &snos_job.metadata.specific {


can try some TryInto implementation to make it cleaner

apoorvsadana · 2025-01-24T08:57:25Z

crates/orchestrator/src/workers/update_state.rs

@@ -96,26 +94,73 @@ impl Worker for UpdateStateWorker {
                }
            }
            None => {
-                if blocks_to_process[0] != 0 {
+                if blocks_to_process[0] != 0 && blocks_to_process[0] != 1 {


need to remove this

coveralls · 2025-01-31T22:00:39Z

coverage: 45.386% (-21.7%) from 67.086%
when pulling 04e9324 on refactor/job-isolation
into afb9afe on main.

heemankv

Please feel free to make separate issues for comments you don't feel fit to resolve in this PR, please also add the issue link in the respective comment.

Note: I have skipped through reading the tests this time

heemankv · 2025-02-12T10:24:31Z

crates/orchestrator/src/constants.rs

@@ -2,3 +2,10 @@ pub const BLOB_DATA_FILE_NAME: &str = "blob_data.txt";
 pub const SNOS_OUTPUT_FILE_NAME: &str = "snos_output.json";
 pub const PROGRAM_OUTPUT_FILE_NAME: &str = "program_output.txt";
 pub const CAIRO_PIE_FILE_NAME: &str = "cairo_pie.zip";
+
+pub const JOB_METADATA_CAIRO_PIE_PATH: &str = "cairo_pie_path";


heemankv · 2025-02-12T11:22:21Z

crates/orchestrator/src/database/mongodb/mod.rs

        match cursor.try_next().await? {
            Some(doc) => {
-                let job: JobItem = mongodb::bson::from_document(doc)?;


IMO we should use .map_error ?

let job: JobItem = mongodb::bson::from_document(doc).map_err(|e| { tracing::error!(error = %e, document = ?doc, "Failed to deserialize document into JobItem"); e })?;

heemankv · 2025-02-12T12:20:42Z

crates/orchestrator/src/jobs/metadata.rs

+    pub verification_retry_attempt_no: u64,
+    /// Timestamp when job processing completed
+    #[serde(with = "chrono::serde::ts_seconds_option")]
+    pub process_completed_at: Option<DateTime<Utc>>,


As discussed, let's add process_started_at and verification_started_at as well

we can leave out the same for create jobs since it's always almost instantaneous and we usually don't care about the time it takes

heemankv · 2025-02-12T12:29:28Z

crates/orchestrator/src/jobs/da_job/mod.rs

+        );
+
+        // Get DA-specific metadata
+        let mut da_metadata: DaMetadata = job.metadata.specific.clone().try_into().map_err(|e| {


I am not sure of this idea of "specific" and "common" inside metadata.
What is your opinion on two fields within the job itself ?
i.e
instead of metadata.specific and metadata.common we should have
commonMetaData and JobMetadata directly in the job,
Do feel the current structure is necessary ?

heemankv · 2025-02-12T12:32:24Z

crates/orchestrator/src/jobs/da_job/mod.rs

        if current_blob_length > max_blob_per_txn {
-            tracing::warn!(job_id = ?job.id, current_blob_length = current_blob_length, max_blob_per_txn = max_blob_per_txn, "Exceeded maximum number of blobs per transaction");
+            tracing::warn!(


This should be a error only if we are throwing an error after this, isn't it ?

heemankv · 2025-02-12T17:06:23Z

crates/orchestrator/src/jobs/state_update_job/utils.rs

-    Ok(transformed_blob_vec_u8)
+    // Get the path for this block
+    let path =
+        blob_data_paths.get(block_index).ok_or_else(|| eyre!("Blob data path not found for index {}", block_index))?;


What are we gaining by passing the index and the blob_data_paths down the props ?
We are sending them

from process_job

to update_state_for_block

to fetch_program_data_for_block

and are just fetching the value at the index ?

IMO it seems simpler to just fetch the block_number inside process_job and send here.

also follows the same for :

fetch_snos_for_block

fetch_program_output_for_block

heemankv · 2025-02-12T17:22:07Z

crates/orchestrator/src/jobs/state_update_job/mod.rs

            let tx_inclusion_status =
                settlement_client.verify_tx_inclusion(tx_hash).await.map_err(|e| JobError::Other(OtherError(e)))?;
+
            match tx_inclusion_status {
                SettlementVerificationStatus::Rejected(_) => {


Nitpick : Why are we not using recursion here!!! I just realised it haha!
Could be crazy if we implement it.

heemankv · 2025-02-12T17:28:42Z

crates/orchestrator/src/jobs/state_update_job/mod.rs

@@ -336,36 +412,44 @@ impl StateUpdateJob {
    }

    /// Retrieves the SNOS output for the corresponding block.
-    async fn fetch_snos_for_block(&self, block_no: u64, config: Arc<Config>) -> Result<StarknetOsOutput, JobError> {
+    async fn fetch_snos_for_block(


fetch_snos_for_block

fetch_program_output_for_block
No need for these functions to be a part of this impl,
they are not using self anywhere, we can move these to utils.rs

heemankv · 2025-02-12T18:12:17Z

crates/prover-clients/atlantic-service/src/lib.rs

-                let fact = B256::from_str(fact).map_err(|e| ProverClientError::FailedToConvertFact(e.to_string()))?;
-                if self.fact_checker.is_valid(&fact).await? {
-                    Ok(TaskStatus::Succeeded)
+                if cross_verify {


I think we can make these nested conditions better,

by using a guard clause for when cross_verify is false.

by using a guard clause for when fact is None.

Here's a reply from claude to help :

if !cross_verify { tracing::debug!("Skipping cross-verification as it's disabled"); return Ok(TaskStatus::Succeeded); } let Some(fact_str) = fact else { return Ok(TaskStatus::Failed("Cross verification enabled but no fact provided".to_string())); }; let fact = B256::from_str(&fact_str) .map_err(|e| ProverClientError::FailedToConvertFact(e.to_string()))?; tracing::debug!(fact = %hex::encode(&fact), "Cross-verifying fact on chain"); if self.fact_checker.is_valid(&fact).await? { Ok(TaskStatus::Succeeded) } else { Ok(TaskStatus::Failed(format!( "Fact {} is not valid or not registered", hex::encode(fact) ))) }

heemankv · 2025-02-12T18:13:52Z

crates/prover-clients/sharp-service/src/lib.rs

-                        "Cairo PIE task status: ONCHAIN and fact is valid."
-                    );
+            CairoJobStatus::ONCHAIN => match fact {
+                Some(fact_str) => {


Same as above,
we can use guard clause instead of match fact

Mohiiit added 2 commits January 22, 2025 11:43

refactor: job isolation done

e703bff

fix: tests fixed

ff84440

apoorvsadana reviewed Jan 22, 2025

View reviewed changes

Mohiiit added 2 commits January 23, 2025 16:03

feat: metadata struct introduced

decc1ad

refactor: metadata introduced to each job, happy flow working

d11c246

apoorvsadana reviewed Jan 24, 2025

View reviewed changes

Mohiiit added 4 commits January 31, 2025 11:22

chore: comments resolved

a35594a

fix: e2e test should work

7336d08

fix: fixing the snos fact

37cff7a

chore: linting and e2e test

d0f2e34

Mohiiit and others added 5 commits February 1, 2025 03:47

fix: snos full output set to false in e2e

11d3be9

chore: linting

b1ba517

chore: linting, and addressing a few comments

9c2c64f

Merge branch 'main' into refactor/job-isolation

e3dcaea

chore: docs added

04e9324

heemankv requested changes Feb 12, 2025

View reviewed changes

	tracing::info!(raw_document = ?doc, "Raw document from MongoDB");
	tracing::debug!(raw_document = ?doc, "Raw document from MongoDB");

	pub tx_hashes: Vec<String>, // key: attempt_no, value: comma-separated tx hashes
	pub tx_hashes: Vec<String>

refactor: job isolation done #204

Are you sure you want to change the base?

refactor: job isolation done #204

Conversation

Mohiiit commented Jan 22, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coveralls commented Jan 31, 2025 • edited Loading

heemankv left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coveralls commented Jan 31, 2025 •

edited

Loading