Q1 (Code)

Q234 (Code)

Q3(a)

Why I include the features:

total_votes: This feature represents the total number of votes a review has received. It can be relevant because reviews with a higher number of votes may indicate a higher level of engagement or popularity. This feature captures the overall attention or interest the review has garnered from readers.

title_len: This feature represents the length of the product title associated with each review. The length of the title can provide additional context and information about the product being reviewed. It is possible that the length of the title may influence the perception or impact of the review on readers.

helpful_prop: This feature calculates the proportion of helpful votes out of the total votes received for each review. It captures the extent to which other readers found the review helpful. A higher proportion of helpful votes suggests that the review is more informative or valuable to readers, potentially indicating a good review.

verified_pur: This feature indicates whether the review is from a verified purchase. Reviews from verified purchases are typically given more credibility and trustworthiness as they come from customers who have actually purchased the product. This feature helps capture the influence of verified purchases on determining good reviews.

Q3(c)

In Spark, when we specify a series of transformations in a pipeline, the DataFrame is not processed immediately. This is because Spark operates on the principles of lazy evaluation. This means it only computes the results when an action is called (like count, collect, show, or fit in the provided code), not when transformations (like withColumn, sampleBy, or transform) are defined.

The operations we define are recorded in a Directed Acyclic Graph (DAG). When an action is triggered, Spark optimizes this DAG to determine the most efficient way to execute the operations, considering factors such as data locality and partitioning. This is part of what allows Spark to handle large-scale data processing efficiently.

In the given Spark code, the transformations are not actually executed until show is called on the DataFrame sampled and fit is called on the model in the Train function. persist is used to cache the intermediate data after each transformation, which can speed up computation especially when the data is reused multiple times in subsequent transformations or actions.

Comparatively, Dask also operates on principles of lazy evaluation and uses a similar task scheduling approach. However, Dask's execution model is more dynamic and flexible than Spark's, which is more rigid and based on the two-stage MapReduce paradigm. Dask builds a dynamic task graph that can adapt to handle complex and irregular computation patterns that don't fit neatly into a MapReduce model.

In the Dask structure, computations are not performed until compute is called, similar to an action in Spark. Once compute is called, Dask constructs the task graph, optimizes it, and schedules the tasks for execution, returning a Pandas DataFrame with the results.

In summary, both Spark and Dask follow a lazy evaluation model, building a task graph for computation and only executing when an action is triggered. The key difference lies in their scheduling and execution models, with Dask being more dynamic and adaptable to irregular computations, while Spark adheres more strictly to the MapReduce model.

Q4

For Label 0:

False Positive Rate: 47.14%. This means that out of all instances that were actually not label 0, the model incorrectly predicted them as label 0 around 47.14% of the time.

True Positive Rate (also known as Sensitivity or Recall): 76.64%. This means that out of all instances that were actually label 0, the model correctly predicted them as label 0 about 76.64% of the time.

For Label 1:

False Positive Rate: 23.36%. This means that out of all instances that were actually not label 1, the model incorrectly predicted them as label 1 about 23.36% of the time.

True Positive Rate: 52.86%. This means that out of all instances that were actually label 1, the model correctly predicted them as label 1 about 52.86% of the time.

What the model does well:

It has a relatively high true positive rate for Label 0, meaning it does a decent job at correctly identifying instances of this class.

What the model does poorly:

The model has a high false positive rate for Label 0, meaning it often misclassifies instances as belonging to this class when they do not.

The model has a relatively low true positive rate for Label 1, suggesting that it often misses instances of this class.

Ways to improve the model:

To improve the low true positive rate for Label 1, we could explore feature engineering to create more relevant predictors, or try different modeling techniques that may be better suited to our specific problem. We could also tune the model hyperparameters to see if that improves performance. Lastly, it would be worth reviewing the loss function and evaluation metrics to ensure they are correctly incentivizing the model's predictions. For example, we may want to penalize false positives more heavily if they are particularly costly in our application.

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
.gitattributes		.gitattributes
PA3_Setup.ipynb		PA3_Setup.ipynb
Q1.ipynb		Q1.ipynb
Q234_副本.ipynb		Q234_副本.ipynb
README.md		README.md
Untitled.ipynb		Untitled.ipynb
bootstrap		bootstrap
dataset_zh_test.json		dataset_zh_test.json
dataset_zh_train.json.zip		dataset_zh_train.json.zip
emr_pyspark_script.py		emr_pyspark_script.py
labsuser.pem		labsuser.pem
process.py		process.py
processed_dataset_zh_test.csv		processed_dataset_zh_test.csv
spark.ipynb		spark.ipynb
test.ipynb		test.ipynb
test.py		test.py
test.sbatch		test.sbatch

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Q1 (Code)

Q234 (Code)

Q3(a)

Why I include the features:

Q3(c)

In the Dask structure, computations are not performed until compute is called, similar to an action in Spark. Once compute is called, Dask constructs the task graph, optimizes it, and schedules the tasks for execution, returning a Pandas DataFrame with the results.

Q4

For Label 0:

False Positive Rate: 47.14%. This means that out of all instances that were actually not label 0, the model incorrectly predicted them as label 0 around 47.14% of the time.

True Positive Rate (also known as Sensitivity or Recall): 76.64%. This means that out of all instances that were actually label 0, the model correctly predicted them as label 0 about 76.64% of the time.

For Label 1:

False Positive Rate: 23.36%. This means that out of all instances that were actually not label 1, the model incorrectly predicted them as label 1 about 23.36% of the time.

True Positive Rate: 52.86%. This means that out of all instances that were actually label 1, the model correctly predicted them as label 1 about 52.86% of the time.

What the model does well:

It has a relatively high true positive rate for Label 0, meaning it does a decent job at correctly identifying instances of this class.

What the model does poorly:

The model has a high false positive rate for Label 0, meaning it often misclassifies instances as belonging to this class when they do not.

The model has a relatively low true positive rate for Label 1, suggesting that it often misses instances of this class.

Ways to improve the model:

About

Releases

Packages

Languages

QichangZheng/MACS30123_Project

Folders and files

Latest commit

History

Repository files navigation

Q1 (Code)

Q234 (Code)

Q3(a)

Why I include the features:

Q3(c)

In the Dask structure, computations are not performed until compute is called, similar to an action in Spark. Once compute is called, Dask constructs the task graph, optimizes it, and schedules the tasks for execution, returning a Pandas DataFrame with the results.

Q4

For Label 0:

False Positive Rate: 47.14%. This means that out of all instances that were actually not label 0, the model incorrectly predicted them as label 0 around 47.14% of the time.

True Positive Rate (also known as Sensitivity or Recall): 76.64%. This means that out of all instances that were actually label 0, the model correctly predicted them as label 0 about 76.64% of the time.

For Label 1:

False Positive Rate: 23.36%. This means that out of all instances that were actually not label 1, the model incorrectly predicted them as label 1 about 23.36% of the time.

True Positive Rate: 52.86%. This means that out of all instances that were actually label 1, the model correctly predicted them as label 1 about 52.86% of the time.

What the model does well:

It has a relatively high true positive rate for Label 0, meaning it does a decent job at correctly identifying instances of this class.

What the model does poorly:

The model has a high false positive rate for Label 0, meaning it often misclassifies instances as belonging to this class when they do not.

The model has a relatively low true positive rate for Label 1, suggesting that it often misses instances of this class.

Ways to improve the model:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages