Skip to content

Latest commit

 

History

History
28 lines (22 loc) · 1.22 KB

README.md

File metadata and controls

28 lines (22 loc) · 1.22 KB

Feature Selection and Classification on High-dimensional Brain Cancer Microarray Data

Currently, techniques such as microarrays can give large data about gene expression with limited samples. We choose brain cancers to study due to its low incidence which is 6.3 per 100,000 men and women per year, and use feature selection to find optimal features for multiclass classification.

Pipeline

Dataset: "Brain_GSE50161.csv"
Feature Selections:

  1. our pipeline with variance: "feature_selection_with_variance.ipynb"
    • input: "Brain_GSE50161.csv"
    • output: "df_w_var.csv"
  2. our pipeline without variance: "feature_selection_with_variance.ipynb"
    • input: "Brain_GSE50161.csv"
    • output: "df_wo_var.csv"
  3. LASSO: "feature_selection_with_lasso.ipynb"
    • input: "Brain_GSE50161.csv"
    • output: "df_lasso.csv"

Classifications:

  1. Run multiclass classification with the dataset generated by the three feature selections scripts: "Classification.ipynb"
    • input: "df_w_var.csv" or "df_wo_var.csv" or "df_lasso.csv"
    • output: accuracy, F1 score, confusion matrices
  2. Perform PCA and then run multiclass classification: "PCA.ipynb"
    • input: "Brain_GSE50161.csv"
    • output: accuracy, F1 score, confusion matrices