Skip to content

Latest commit

 

History

History
90 lines (58 loc) · 5.65 KB

File metadata and controls

90 lines (58 loc) · 5.65 KB

Determining Degrees of Separation between Marvel Superheroes with Breadth-first Search in PySpark RDD

GitHub License: MIT Made with Python Apache Spark


In this project, we aim to determine the degree of separation between two Marvel superheroes based on their network connections within the Marvel universe. The connections between superheroes are established by analysing their appearances together in the same comic books, reflecting their interactions and collaborations within the Marvel universe. If two superheroes have ever appeared in the same comic books, they are considered directly connected to each other. To find the degrees of separation, we employ the powerful Breadth-first Search (BFS) algorithm implemented in PySpark RDD. This algorithm allows us to efficiently explore the network of superhero connections and calculate the shortest path(s) between any two given superheroes. The concept of implementing BFS algorithm in this project is demonstrated through a simple example illustrated below:

spark-BFS


Features

  • Utilizes the Breadth-first Search (BFS) algorithm implemented in PySpark RDD to find the degrees of separation between two given superheroes efficiently.
  • Employs user-defined functions to perform the BFS algorithm and efficiently explore the network of superhero connections.
  • Provides an interactive function to search for superheroes by their names or IDs, facilitating easy query of hero information.
  • Supports the analysis of various superhero pairs to determine their degrees of separation.

Repository Structure

This repository consists of the following files:

Degrees of Separation with Breadth-first Search
├── Degrees of Separation with Breadth-first Search Algorithm.ipynb
├── data
│   ├── Marvel-graph.txt
│   └── Marvel-names.txt
├── README.md
├── APACHE LICENSE
└── MIT LICENSE

Degrees of Separation with Breadth-first Search Algorithm.ipynb: This Jupyter Notebook serves as the main spark driver file. It contains the code implementation of the BFS algorithm in PySpark RDD to find degrees of separation between heroes.

data/Marvel-graph.txt: Represents the network of Marvel superheroes based on their comic book appearances. It is a dataset containing heroIDs and their corresponding connections with other superheroes.

data/Marvel-names.txt: Provides a mapping between heroIDs and their corresponding superhero names. This dataset allows easy identification and reference to specific superheroes within the Marvel universe.

README.md: The current file you are reading. It provides an overview of the repository, project description, usage, and other relevant information.

APACHE LICENSE: The license file for the project.

MIT LICENSE: The license file for the project.


Usage

Note: As of July 2023, PySpark is not fully compatible with Python 3.11. It is highly recommended to use Python 3.7 or 3.8 for executing any PySpark-related operations.

To set up the environment for running the Spark code, follow these steps:

  1. Open Anaconda Prompt and execute the following command to create a conda environment powered by Python 3.8:

    conda create -n py38 python=3.8 anaconda
    
  2. Open Anaconda Navigator, go to Environments, and select the py38 environment to install the "pyspark" package.

  3. In the py38 environment folder (e.g. "C:\Users\[your-user-name]\anaconda3\envs\py38"), create a copy of python.exe and rename it to "python3.exe".

  4. Open the Jupyter Notebook application from your Anaconda Navigator. The notebooks you open should now run on Python 3.8. You can check the Python version by executing !python --version in a notebook code cell.

  5. Download this repository and open the Degrees of Separation with Breadth-first Search Algorithm.ipynb in your Jupyter Notebook.


Contribution

Contributions to the project are welcome! If you find any issues or have suggestions for improvement, please feel free to open an issue or submit a pull request.


License

The project is licensed under the MIT License and Apache License.


Acknowledgement

I extend my gratitude to the Marvel universe and its creators for providing the rich dataset that makes this project possible. Additionally, I thank the PySpark and Apache Spark communities for their valuable contributions to the data processing and analysis ecosystem.