diff --git a/notebooks/GenAI/AWS_Amazon_Q.ipynb b/notebooks/GenAI/AWS_Amazon_Q.ipynb index 651ab11..76a0069 100644 --- a/notebooks/GenAI/AWS_Amazon_Q.ipynb +++ b/notebooks/GenAI/AWS_Amazon_Q.ipynb @@ -6,6 +6,8 @@ "source": [ "# Introduction to Amazon Q Developer\n", "\n", + "**Difficulty Level: Beginner**\n", + "\n", "### Overview\n", "\n", "Amazon Q Developer is a generative artificial intelligence (AI) powered conversational assistant designed to enhance the \n", @@ -30,7 +32,7 @@ "### Installing Amazon Q Developer on SageMaker Studio\n", "1. Navigate to SageMaker Studio and create a domain\n", "\n", - "Note: In order to make Amazon Q available within the JupyterLab, you will need to modify the IAM permissions associated with the SageMaker ExecutionRole. You will find the domain name in your SageMaker Studio launch page. \n", + "Note: In order to make Amazon Q Developer available within the JupyterLab, you will need to modify the IAM permissions associated with the SageMaker ExecutionRole. You will find the domain name in your SageMaker Studio launch page. \n", "\n", "\n", "2. Navigate to IAM and search for the execution role pertaining to your SageMaker domain. The execution role will be named AmazonSageMaker-ExecutionRole-(SageMaker domain id)\n", @@ -38,9 +40,9 @@ "\n", "4. Click on the role and scroll down to the role policies\n", "\n", - "![alt text](https://github.com/STRIDES/NIHCloudLabAWS/blob/drafts/docs/images/Q-IAM-role.png)\n", + "![alt text](../../Q-IAM-role.png)\n", "\n", - "![alt text](https://github.com/STRIDES/NIHCloudLabAWS/blob/drafts/docs/images/Q-role-policy.png)\n", + "![alt text](../../Q-role-policy.png)\n", "\n", "5. Add the following policy to the IAM role: \n", "\n", @@ -66,19 +68,17 @@ "```\n", "6. Review and save your changes\n", "\n", - "![alt text](https://github.com/STRIDES/NIHCloudLabAWS/blob/drafts/docs/images/Q-iam-policy-review.png)\n", + "![alt text](../../Q-iam-policy-review.png)\n", "\n", "7. Open your SageMaker studio UI and create a JupyterLab Space\n", "\n", - "![alt text](https://github.com/STRIDES/NIHCloudLabAWS/blob/drafts/docs/images/Q-jupy-lab.png)\n", + "![alt text](../../Q-jupy-lab.png)\n", "\n", "8. Click on the Amazon Q logo on the left panel \n", "\n", - "![alt text](https://github.com/STRIDES/NIHCloudLabAWS/blob/drafts/docs/images/Q-amazon-q-jup.png)\n", - "\n", - "Alternatively, you may open a CodeEditor application from the SageMaker Studio UI and install Amazon Q as an extension. Please note that when this application is shut down, the extension will be removed. \n", + "![alt text](../../Q-amazon-q-jup.png)\n", "\n", - "### Let's get started! \n" + "Alternatively, you may open a CodeEditor application from the SageMaker Studio UI and install Amazon Q Developer as an extension. Please note that when this application is shut down, the extension will be removed. " ] }, { @@ -97,8 +97,7 @@ "wget https://raw.githubusercontent.com/STRIDES/NIHCloudLabAWS/refs/heads/drafts/notebooks/GenAI/example_scripts/bioinformatics_testing.py --no-check-certificate \n", "wget https://raw.githubusercontent.com/STRIDES/NIHCloudLabAWS/refs/heads/drafts/notebooks/GenAI/example_scripts/quick-actions-testing.ipynb --no-check-certificate \n", "\n", - "```\n", - "\n" + "```" ] }, { @@ -126,7 +125,7 @@ "\n", "Try it out! Let's try adding a sixth step to the `bioinformatics-testing.py` script, where we will run samtools sort, to this script. As you type, press the tab key to see what the coding assistant suggests for you!\n", "\n", - "![alt text](https://github.com/STRIDES/NIHCloudLabAWS/blob/drafts/docs/images/Q-code-completion-1.png.png)\n", + "![alt text](../../Q-code-completion-1.png.png)\n", "\n", "Samtools Sorting Example: \n", "```python\n", @@ -152,174 +151,47 @@ "The quick actions menu provides a list of ways that you may prompt the coding assistant. In this use case, we will test the `/fix`, `optimize`, and `explain` quick actions \n", "\n", "##### **`/fix` Prompt**\n", - " 1. Open the quick-actions-test.ipynb file\n", - " 2. Run the notebook\n", - " 3. Select the cell that contains and error - Cell 3\n", - " 4. Navigate to the Amazon Q search bar and type in `/fix`\n", - " 5. Click on the down arrow next to the send button and select \"Send message with selection\" \n", - "\n", - "##### **Response** \n", - "I'll help fix this cell. The error occurs because `describe()` is a method that should be called on a DataFrame, but here it's being called directly and the DataFrame variable `df` is incorrectly placed after it.\n", "\n", - "Here's the corrected code:\n", + "1. Open the quick-actions-test.ipynb file\n", + "2. Run the notebook\n", + "3. Select the cell that contains and error - Cell 3\n", + "4. Navigate to the Amazon Q Developer search bar and type in `/fix`\n", + "5. Click on the down arrow next to the send button and select \"Send message with selection\" \n", "\n", - "```python\n", - "# View summary statistics\n", - "df.describe()\n", - "```\n", + "![alt text](../../Q-send-cell-with-prompt.png)\n", "\n", - "This will generate summary statistics including:\n", - "- count\n", - "- mean\n", - "- std (standard deviation)\n", - "- min\n", - "- 25% (first quartile)\n", - "- 50% (median)\n", - "- 75% (third quartile)\n", - "- max\n", - "\n", - "For more detailed statistics, you could also specify which statistics to include:\n", - "```python\n", - "df.describe(include='all', percentiles=[.05, .25, .5, .75, .95])\n", - "```\n", + "##### **Response** \n", "\n", - "Let me know if you need any clarification!\n", + "![alt text](../../Q-fix.png)\n", "\n", "##### **Response Breakdown**\n", "* The response contains the corrected code, a description of what the code does and suggestions for improving it. \n", "* You can easily implement the suggested changes by clicking on the three dots at the top of the response and selecting \"Replace selection\"\n", "\n", "##### **`/optimize` Prompt**\n", - " 1. Select the cell in which data is added to the dataframe - Cell 4\n", - " 2. Navigate to the Amazon Q search bar and type in `/optimize`\n", - " 3. Click on the down arrow next to the send button and select \"Send message with selection\" \n", "\n", - "##### **Response**\n", - "\n", - "I'll optimize the code for adding multiple columns to the DataFrame more efficiently. Here's the improved version:\n", - "\n", - "```python\n", - "# Add multiple columns efficiently using a dictionary and assign()\n", - "new_data = {\n", - " 'Sample_Type': ['Tissue1', 'Tissue2', 'Tissue1', 'Tissue2'],\n", - " 'P_Value': [0.05, 0.01, 0.03, 0.07]\n", - "}\n", - "df = df.assign(**new_data)\n", - "\n", - "# Optional: Verify the data types are optimal\n", - "df = df.astype({\n", - " 'Sample_Type': 'category', # Convert to category type for memory efficiency\n", - " 'P_Value': 'float32' # Use float32 instead of float64 if precision allows\n", - "})\n", - "```\n", - "\n", - "Key optimizations:\n", - "1. Uses `assign()` for a single operation instead of multiple assignments\n", - "2. Converts string column to category type for memory efficiency\n", - "3. Uses appropriate data type for numerical values\n", - "4. Reduces DataFrame copies from 2 to 1\n", + "1. Select the cell in which data is added to the dataframe - Cell 4\n", + "2. Navigate to the Amazon Q Developer search bar and type in `/optimize`\n", + "3. Click on the down arrow next to the send button and select \"Send message with selection\" \n", "\n", - "Alternative method if you prefer updating in place:\n", - "```python\n", - "# Alternative optimization using update\n", - "df.update(pd.DataFrame(new_data, index=df.index))\n", - "```\n", + "##### **Response**\n", "\n", - "These optimizations are particularly beneficial for:\n", - "- Large datasets\n", - "- Memory management\n", - "- Performance\n", - "- Code maintainability\n", + "![alt text](../../Q-optimize.png)\n", "\n", "##### **Response Breakdown** \n", "A few methods of optimization are suggested. \n", "\n", "##### **`/explain` Prompt**\n", - " 1. Select the cell in which matplotlib is used to create a plot from the dataframe - Cell 5\n", - " 2. Navigate to the Amazon Q search bar and type in `/explain`\n", - " 3. Click on the down arrow next to the send button and select \"Send message with selection\" \n", + "1. Select the cell in which matplotlib is used to create a plot from the dataframe - Cell 5\n", + "2. Navigate to the Amazon Q Developer search bar and type in `/explain`\n", + "3. Click on the down arrow next to the send button and select \"Send message with selection\" \n", "\n", "##### **Response** \n", - "I'll explain this code block line by line and suggest some enhancements:\n", - "\n", - "```python\n", - "# Import matplotlib's pyplot module for plotting\n", - "import matplotlib.pyplot as plt\n", - "\n", - "# Create a line plot using pandas DataFrame plotting interface\n", - "df.plot(\n", - " x='Sample_ID', # X-axis: Sample identifiers\n", - " y='Expression_Level', # Y-axis: Expression level values\n", - " kind='line' # Type of plot: line graph\n", - ")\n", - "\n", - "# Display the plot\n", - "plt.show()\n", - "```\n", - "\n", - "Let's enhance this visualization with better formatting and features:\n", - "\n", - "```python\n", - "# Create an enhanced line plot\n", - "plt.figure(figsize=(10, 6)) # Set figure size for better visibility\n", - "\n", - "# Create the plot with additional parameters\n", - "df.plot(\n", - " x='Sample_ID', \n", - " y='Expression_Level',\n", - " kind='line',\n", - " marker='o', # Add data point markers\n", - " linestyle='-', # Solid line style\n", - " linewidth=2, # Thicker line\n", - " color='steelblue', # Professional color\n", - " title='Gene Expression Levels Across Samples'\n", - ")\n", - "\n", - "# Customize the plot\n", - "plt.xlabel('Sample ID', fontsize=12)\n", - "plt.ylabel('Expression Level', fontsize=12)\n", - "plt.grid(True, alpha=0.3) # Add subtle grid\n", - "plt.xticks(rotation=45) # Rotate x-axis labels for better readability\n", - "\n", - "# Add mean expression line\n", - "plt.axhline(\n", - " y=df['Expression_Level'].mean(),\n", - " color='red',\n", - " linestyle='--',\n", - " alpha=0.5,\n", - " label='Mean Expression'\n", - ")\n", - "\n", - "plt.legend()\n", - "plt.tight_layout() # Adjust layout to prevent label cutoff\n", - "plt.show()\n", - "```\n", "\n", - "Key components explained:\n", - "1. **Data Visualization**:\n", - " - Shows expression levels (y-axis) for each sample (x-axis)\n", - " - Line connects points to show trends\n", - " - Markers highlight actual data points\n", - "\n", - "2. **Formatting**:\n", - " - Figure size set for clarity\n", - " - Axis labels and title added\n", - " - Grid helps read values\n", - " - Rotated x-axis labels prevent overlap\n", - "\n", - "3. **Statistical Context**:\n", - " - Mean line adds reference\n", - " - Legend identifies plot elements\n", - "\n", - "This visualization helps:\n", - "- Track expression changes across samples\n", - "- Identify patterns or trends\n", - "- Spot potential outliers\n", - "- Compare expression levels between samples\n", - "- Reference against mean expression\n", + "![alt text](../../Q-explain.png)\n", "\n", "##### **Response Breakdown** \n", - "* The functions present in the code snippet are explained. Additionally, Amazon Q provides suggestions to enhance the code" + "* The functions present in the code snippet are explained. Additionally, suggestions to enhance the code are provided" ] }, { @@ -334,7 +206,7 @@ "\n", "Prompting can be used to modify an existing script. We will incorporate parallel processing into a script. Parallel processing allows a program to execute multiple tasks simultaneously, which can significantly speed up the execution time, especially for tasks that are computationally intensive. \n", "\n", - "Note: In this prompt, we include the script. When utilizing Amazon Q in other IDEs such as VSCode or Code Editor, you may reference files through their filepaths. \n", + "Note: In this prompt, we include the script. When utilizing Amazon Q Developer in other IDEs such as VSCode or Code Editor, you may reference files through their filepaths. \n", "\n", "##### **Prompt** \n", "\n", @@ -376,126 +248,7 @@ "source": [ "##### **Response**\n", "\n", - "I'll modify your script to use parallel processing for the FastQC and STAR alignment steps, as these are independent operations that can be run in parallel. Here's the modified version:\n", - "\n", - "```python\n", - "import pandas as pd\n", - "import subprocess\n", - "import multiprocessing as mp\n", - "from functools import partial\n", - "import os\n", - "\n", - "def run_fastqc(row):\n", - " \"\"\"Run FastQC for a single sample\"\"\"\n", - " fastqc_command = f\"fastqc {row['file_path']} -o ./fastqc_results/\"\n", - " try:\n", - " subprocess.run(fastqc_command, shell=True, check=True)\n", - " return f\"FastQC completed for {row['file_path']}\"\n", - " except subprocess.CalledProcessError as e:\n", - " return f\"Error running FastQC for {row['file_path']}: {str(e)}\"\n", - "\n", - "def run_star(row):\n", - " \"\"\"Run STAR alignment for a single sample\"\"\"\n", - " star_command = f\"STAR --genomeDir /path/to/genome --readFilesIn {row['file_path']} --outFileNamePrefix ./star_results/{row['sample_id']}\"\n", - " try:\n", - " subprocess.run(star_command, shell=True, check=True)\n", - " return f\"STAR alignment completed for {row['sample_id']}\"\n", - " except subprocess.CalledProcessError as e:\n", - " return f\"Error running STAR for {row['sample_id']}: {str(e)}\"\n", - "\n", - "def run_samtools_index(row):\n", - " \"\"\"Index BAM file for a single sample\"\"\"\n", - " bam_file = f\"./star_results/{row['sample_id']}.bam\"\n", - " samtools_command = f\"samtools index {bam_file}\"\n", - " try:\n", - " subprocess.run(samtools_command, shell=True, check=True)\n", - " return f\"Samtools indexing completed for {row['sample_id']}\"\n", - " except subprocess.CalledProcessError as e:\n", - " return f\"Error running Samtools for {row['sample_id']}: {str(e)}\"\n", - "\n", - "def parallel_process(function, data, num_processes=None):\n", - " \"\"\"Generic parallel processing function\"\"\"\n", - " if num_processes is None:\n", - " num_processes = max(1, mp.cpu_count() - 1) # Leave one CPU free\n", - " \n", - " with mp.Pool(processes=num_processes) as pool:\n", - "\n", - " return results\n", - "\n", - "if __name__ == \"__main__\":\n", - " # Create output directories if they don't exist\n", - " os.makedirs(\"./fastqc_results/\", exist_ok=True)\n", - " os.makedirs(\"./star_results/\", exist_ok=True)\n", - " os.makedirs(\"./multiqc_report/\", exist_ok=True)\n", - "\n", - " # Step 1: Read the sample sheet\n", - " sample_sheet = pd.read_csv('samplesheet.csv')\n", - "\n", - " # Step 2: Run FastQC in parallel\n", - " print(\"Running FastQC in parallel...\")\n", - " fastqc_results = parallel_process(run_fastqc, sample_sheet)\n", - " for result in fastqc_results:\n", - " print(result)\n", - "\n", - " # Step 3: Run MultiQC (this needs to run after all FastQC jobs are complete)\n", - " print(\"\\nRunning MultiQC...\")\n", - " multiqc_command = \"multiqc ./fastqc_results/ -o ./multiqc_report/\"\n", - " try:\n", - " subprocess.run(multiqc_command, shell=True, check=True)\n", - " print(\"MultiQC completed successfully\")\n", - " except subprocess.CalledProcessError as e:\n", - " print(f\"Error running MultiQC: {str(e)}\")\n", - "\n", - " # Step 4: Run STAR aligner in parallel\n", - " print(\"\\nRunning STAR alignment in parallel...\")\n", - " star_results = parallel_process(run_star, sample_sheet)\n", - " for result in star_results:\n", - " print(result)\n", - "\n", - " # Step 5: Index BAM files with Samtools in parallel\n", - " print(\"\\nIndexing BAM files in parallel...\")\n", - " samtools_results = parallel_process(run_samtools_index, sample_sheet)\n", - " for result in samtools_results:\n", - " print(result)\n", - "```\n", - "\n", - "Key improvements in this version:\n", - "\n", - "1. **Parallel Processing**: Uses Python's multiprocessing to run FastQC, STAR alignment, and Samtools indexing in parallel.\n", - "\n", - "2. **Error Handling**: Added try-except blocks to catch and report errors for each process.\n", - "\n", - "3. **Progress Reporting**: Added status messages to track progress.\n", - "\n", - "4. **Resource Management**: \n", - " - Automatically determines optimal number of processes based on CPU cores\n", - " - Uses context managers (`with` statements) for proper resource cleanup\n", - "\n", - "5. **Directory Creation**: Automatically creates output directories if they don't exist.\n", - "\n", - "To customize the parallel processing:\n", - "```python\n", - "# You can specify the number of processes explicitly:\n", - "fastqc_results = parallel_process(run_fastqc, sample_sheet, num_processes=4)\n", - "```\n", - "\n", - "Important notes:\n", - "\n", - "1. The script will use (number of CPU cores - 1) processes by default. You can adjust this by specifying `num_processes`.\n", - "\n", - "2. Memory usage will increase with the number of parallel processes. Monitor system resources and adjust accordingly.\n", - "\n", - "3. Make sure your system has enough resources to handle parallel processing of your data.\n", - "\n", - "4. Some tools (like STAR) might have their own parallel processing capabilities. You might need to adjust their thread settings to avoid oversubscription.\n", - "\n", - "5. The MultiQC step remains sequential as it needs to process all FastQC results at once.\n", - "\n", - "To optimize performance, you might want to:\n", - "- Adjust the number of processes based on your system's capabilities\n", - "- Consider the I/O limitations of your storage system\n", - "- Monitor memory usage and adjust accordingly\n", - "- Consider using tool-specific parallelization options (e.g., STAR's built-in threading)\n" + "![alt text](../../Q-parallel-processing.png)" ] }, { @@ -558,225 +311,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "##### **Response** \n", - "Here's an optimized version of your script with several improvements: [[1]](https://stackoverflow.com/questions/48773816)\n", - "\n", - "```python\n", - "import pandas as pd\n", - "import subprocess\n", - "import multiprocessing as mp\n", - "import os\n", - "import logging\n", - "from datetime import datetime\n", - "from pathlib import Path\n", - "import sys\n", - "from typing import List, Dict\n", - "\n", - "# Configure logging\n", - "def setup_logging(log_dir: str = \"logs\") -> None:\n", - " \"\"\"Set up logging configuration\"\"\"\n", - " Path(log_dir).mkdir(parents=True, exist_ok=True)\n", - " timestamp = datetime.now().strftime(\"%Y%m%d_%H%M%S\")\n", - " log_file = f\"{log_dir}/pipeline_{timestamp}.log\"\n", - " \n", - " logging.basicConfig(\n", - " level=logging.INFO,\n", - " format='%(asctime)s - %(levelname)s - %(message)s',\n", - " handlers=[\n", - " logging.FileHandler(log_file),\n", - " logging.StreamHandler(sys.stdout)\n", - " ]\n", - " )\n", - "\n", - "def create_directories(dirs: List[str]) -> None:\n", - " \"\"\"Create necessary directories if they don't exist\"\"\"\n", - " for dir_path in dirs:\n", - " Path(dir_path).mkdir(parents=True, exist_ok=True)\n", - " logging.info(f\"Created directory: {dir_path}\")\n", - "\n", - "def run_command(cmd: str, description: str) -> bool:\n", - " \"\"\"Execute a shell command and handle errors\"\"\"\n", - " try:\n", - " logging.info(f\"Starting {description}\")\n", - " result = subprocess.run(\n", - " cmd,\n", - " shell=True,\n", - " check=True,\n", - " capture_output=True,\n", - " text=True\n", - " )\n", - " logging.info(f\"Completed {description}\")\n", - " return True\n", - " except subprocess.CalledProcessError as e:\n", - " logging.error(f\"Error in {description}: {e}\")\n", - " logging.error(f\"Command output: {e.output}\")\n", - " return False\n", - "\n", - "def process_sample(row: pd.Series, config: Dict) -> Dict:\n", - " \"\"\"Process a single sample through the pipeline\"\"\"\n", - " results = {\"sample_id\": row['sample_id'], \"success\": True, \"errors\": []}\n", - " \n", - " # Run FastQC\n", - " fastqc_cmd = f\"fastqc {row['file_path']} -t {config['threads_per_sample']} -o {config['fastqc_dir']}\"\n", - " if not run_command(fastqc_cmd, f\"FastQC for {row['sample_id']}\"):\n", - " results[\"success\"] = False\n", - " results[\"errors\"].append(\"FastQC failed\")\n", - " \n", - " # Run STAR\n", - " star_cmd = (f\"STAR --genomeDir {config['genome_dir']} \"\n", - " f\"--readFilesIn {row['file_path']} \"\n", - " f\"--runThreadN {config['threads_per_sample']} \"\n", - " f\"--outFileNamePrefix {config['star_dir']}/{row['sample_id']}\")\n", - " if not run_command(star_cmd, f\"STAR alignment for {row['sample_id']}\"):\n", - " results[\"success\"] = False\n", - " results[\"errors\"].append(\"STAR alignment failed\")\n", - " \n", - " # Run Samtools\n", - " bam_file = f\"{config['star_dir']}/{row['sample_id']}.bam\"\n", - " samtools_cmd = f\"samtools index {bam_file}\"\n", - " if not run_command(samtools_cmd, f\"Samtools indexing for {row['sample_id']}\"):\n", - " results[\"success\"] = False\n", - " results[\"errors\"].append(\"Samtools indexing failed\")\n", - " \n", - " return results\n", - "\n", - "def main():\n", - " # Configuration\n", - " config = {\n", - " \"fastqc_dir\": \"./fastqc_results\",\n", - " \"multiqc_dir\": \"./multiqc_report\",\n", - " \"star_dir\": \"./star_results\",\n", - " \"genome_dir\": \"/path/to/genome\",\n", - " \"log_dir\": \"./logs\",\n", - " \"threads_per_sample\": 4,\n", - " \"max_parallel_samples\": max(1, mp.cpu_count() - 1)\n", - " }\n", - " \n", - " # Setup\n", - " setup_logging(config['log_dir'])\n", - " create_directories([\n", - " config['fastqc_dir'],\n", - " config['multiqc_dir'],\n", - " config['star_dir'],\n", - " config['log_dir']\n", - " ])\n", - " \n", - " try:\n", - " # Read sample sheet\n", - " logging.info(\"Reading sample sheet\")\n", - " sample_sheet = pd.read_csv('samplesheet.csv')\n", - " \n", - " # Process samples in parallel\n", - " with mp.Pool(processes=config['max_parallel_samples']) as pool:\n", - " results = pool.starmap(\n", - " process_sample,\n", - " [(row, config) for _, row in sample_sheet.iterrows()]\n", - " )\n", - " \n", - " # Run MultiQC\n", - " multiqc_cmd = f\"multiqc {config['fastqc_dir']} -o {config['multiqc_dir']}\"\n", - " run_command(multiqc_cmd, \"MultiQC\")\n", - " \n", - " # Report summary\n", - " successful = sum(1 for r in results if r['success'])\n", - " logging.info(f\"Pipeline completed. \"\n", - " f\"Successful: {successful}/{len(results)} samples\")\n", - " \n", - " # Report errors if any\n", - " for result in results:\n", - " if not result['success']:\n", - " logging.error(f\"Sample {result['sample_id']} failed: \"\n", - " f\"{', '.join(result['errors'])}\")\n", - " \n", - " except Exception as e:\n", - " logging.error(f\"Pipeline failed: {str(e)}\")\n", - " raise\n", - "\n", - "if __name__ == \"__main__\":\n", - " main()\n", - "```\n", - "\n", - "Key optimizations and improvements:\n", - "\n", - "1. **Parallel Processing**:\n", - " - Uses multiprocessing to run samples in parallel\n", - " - Configurable number of parallel processes\n", - " - Efficient resource utilization\n", - "\n", - "2. **Error Handling**:\n", - " - Comprehensive error catching and logging\n", - " - Detailed error messages\n", - " - Continues processing other samples if one fails\n", - "\n", - "3. **Logging System**:\n", - " - Detailed logging with timestamps\n", - " - Both console and file output\n", - " - Organized log directory structure\n", - "\n", - "4. **Code Organization**:\n", - " - Modular functions for better maintainability\n", - " - Type hints for better code clarity\n", - " - Clear separation of concerns\n", - "\n", - "5. **Resource Management**:\n", - " - Configurable threads per sample\n", - " - Automatic CPU core detection\n", - " - Directory creation handling\n", - "\n", - "6. **Progress Tracking**:\n", - " - Status updates for each step\n", - " - Summary report at completion\n", - " - Error reporting for failed samples\n", - "\n", - "To further optimize the script, you could:\n", - "\n", - "1. Add configuration file support:\n", - "```python\n", - "import yaml\n", - "\n", - "def load_config(config_file: str) -> Dict:\n", - " with open(config_file) as f:\n", - " return yaml.safe_load(f)\n", - "```\n", - "\n", - "2. Add checkpointing to resume interrupted runs:\n", - "```python\n", - "def check_completion(sample_id: str, output_dir: str) -> bool:\n", - " \"\"\"Check if sample has already been processed\"\"\"\n", - " return Path(f\"{output_dir}/{sample_id}.bam.bai\").exists()\n", - "```\n", - "\n", - "3. Add memory management:\n", - "```python\n", - "def get_memory_usage() -> float:\n", - " \"\"\"Monitor memory usage\"\"\"\n", - " import psutil\n", - " return psutil.Process().memory_info().rss / 1024 / 1024 # MB\n", - "```\n", - "\n", - "4. Add performance monitoring:\n", - "```python\n", - "from time import time\n", - "\n", - "def timer_decorator(func):\n", - " def wrapper(*args, **kwargs):\n", - " start = time()\n", - " result = func(*args, **kwargs)\n", - " end = time()\n", - " logging.info(f\"{func.__name__} took {end - start:.2f} seconds\")\n", - " return result\n", - " return wrapper\n", - "```\n", - "\n", - "5. Add input validation:\n", - "```python\n", - "def validate_sample_sheet(df: pd.DataFrame) -> bool:\n", - " \"\"\"Validate sample sheet format and contents\"\"\"\n", - " required_columns = ['sample_id', 'file_path']\n", - " return all(col in df.columns for col in required_columns)\n", - "```\n", + "##### **Response**\n", "\n", - "These optimizations make the script more robust, efficient, and maintainable while providing better monitoring and error handling capabilities." + "![alt text](../../Q-optimize-script.png)" ] }, { @@ -809,7 +346,7 @@ "\n", "##### **Prompt** \n", "\n", - "Convert the `starAlign.slurm` script into a Snakemake workflow.\n", + "Convert this `starAlign.slurm` script into a Snakemake workflow.\n", "Script: \n", "\n", "```bash\n", @@ -838,185 +375,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "##### **Response** \n", - "\n", - "I'll convert your SLURM script into a Snakemake workflow. Here's the implementation with a Snakefile and configuration file: [[1]](https://stackoverflow.com/questions/77929511)\n", - "\n", - "```python\n", - "# Snakefile\n", - "\n", - "# Configuration\n", - "configfile: \"config.yaml\"\n", - "\n", - "# Wildcard constraints\n", - "wildcard_constraints:\n", - " sample=\"[A-Za-z0-9_-]+\"\n", - "\n", - "# Rules\n", - "rule all:\n", - " input:\n", - " expand(\"{output_dir}/{sample}_Aligned.sortedByCoord.out.bam\",\n", - " output_dir=config[\"output_dir\"],\n", - " sample=config[\"samples\"])\n", - "\n", - "rule star_align:\n", - " input:\n", - " r1 = \"{input_dir}/{sample}_R1.fastq\",\n", - " r2 = \"{input_dir}/{sample}_R2.fastq\"\n", - " output:\n", - " bam = \"{output_dir}/{sample}_Aligned.sortedByCoord.out.bam\"\n", - " params:\n", - " genome_dir = config[\"genome_dir\"],\n", - " out_prefix = \"{output_dir}/{sample}_\"\n", - " threads: 8\n", - " resources:\n", - " mem_mb = 32000,\n", - " time = \"02:00:00\"\n", - " log:\n", - " \"logs/star/{sample}.log\"\n", - " shell:\n", - " \"\"\"\n", - " STAR --genomeDir {params.genome_dir} \\\n", - " --readFilesIn {input.r1} {input.r2} \\\n", - " --outFileNamePrefix {params.out_prefix} \\\n", - " --runThreadN {threads} \\\n", - " --outSAMtype BAM SortedByCoordinate \\\n", - " 2> {log}\n", - " \"\"\"\n", - "```\n", - "\n", - "And here's the corresponding configuration file:\n", - "\n", - "```yaml\n", - "# config.yaml\n", - "\n", - "# Directories\n", - "input_dir: \"/path/to/input\"\n", - "output_dir: \"/path/to/output\"\n", - "genome_dir: \"/path/to/genome\"\n", - "\n", - "# Sample names (without _R1/R2.fastq suffix)\n", - "samples:\n", - " - \"sample1\"\n", - " - \"sample2\"\n", - " - \"sample3\"\n", - "```\n", - "\n", - "To run this workflow with SLURM, create a cluster configuration file:\n", - "\n", - "```yaml\n", - "# cluster.yaml\n", - "\n", - "__default__:\n", - " partition: \"default\"\n", - " output: \"logs/slurm/slurm-%j.out\"\n", - " error: \"logs/slurm/slurm-%j.err\"\n", - " nodes: 1\n", - " ntasks: 1\n", - "\n", - "star_align:\n", - " cpus-per-task: 8\n", - " mem: \"32G\"\n", - " time: \"02:00:00\"\n", - "```\n", - "\n", - "You can also create a submission script:\n", - "\n", - "```bash\n", - "#!/bin/bash\n", - "# submit.sh\n", - "\n", - "snakemake \\\n", - " --profile slurm \\\n", - " --jobs 20 \\\n", - " --cluster-config cluster.yaml \\\n", - " --use-conda \\\n", - " --conda-frontend mamba\n", - "```\n", - "\n", - "Key features of this implementation:\n", - "\n", - "1. **Modular Configuration**:\n", - " - Separate config files for workflow, cluster, and samples\n", - " - Easy to modify parameters without changing the workflow\n", - "\n", - "2. **Resource Management**:\n", - " - Specifies memory and time requirements\n", - " - Uses threading appropriately\n", - " - Configurable through cluster.yaml\n", - "\n", - "3. **Logging**:\n", - " - Separate log files for each sample\n", - " - SLURM output and error logs\n", - "\n", - "4. **Scalability**:\n", - " - Handles multiple samples\n", - " - Easy to add more samples in config.yaml\n", - "\n", - "To use this workflow:\n", - "\n", - "1. Create the directory structure:\n", - "```bash\n", - "mkdir -p logs/star logs/slurm\n", - "```\n", - "\n", - "2. Modify the config files:\n", - " - Update paths in config.yaml\n", - " - Add your sample names\n", - " - Adjust resources in cluster.yaml if needed\n", - "\n", - "3. Run the workflow:\n", - "```bash\n", - "# With SLURM profile\n", - "snakemake --profile slurm\n", - "\n", - "# Or with direct SLURM submission\n", - "snakemake \\\n", - " --cluster \"sbatch -p {cluster.partition} -t {cluster.time} \\\n", - " --mem {cluster.mem} -c {cluster.cpus-per-task}\" \\\n", - " --cluster-config cluster.yaml \\\n", - " --jobs 20\n", - "```\n", - "\n", - "Additional tips:\n", - "\n", - "1. Add more rules for quality control or downstream analysis:\n", - "```python\n", - "rule fastqc:\n", - " input:\n", - " \"{input_dir}/{sample}_{read}.fastq\"\n", - " output:\n", - " html = \"qc/fastqc/{sample}_{read}_fastqc.html\",\n", - " zip = \"qc/fastqc/{sample}_{read}_fastqc.zip\"\n", - " threads: 1\n", - " resources:\n", - " mem_mb = 2000\n", - " shell:\n", - " \"fastqc -o qc/fastqc {input}\"\n", - "```\n", - "\n", - "2. Add conda environment specification:\n", - "```yaml\n", - "# env.yaml\n", - "channels:\n", - " - bioconda\n", - " - conda-forge\n", - "dependencies:\n", - " - star=2.7.10a\n", - " - fastqc=0.11.9\n", - "```\n", + "##### **Response**\n", "\n", - "3. Include quality control in the workflow:\n", - "```python\n", - "rule all:\n", - " input:\n", - " expand(\"qc/fastqc/{sample}_{read}_fastqc.html\",\n", - " sample=config[\"samples\"],\n", - " read=[\"R1\", \"R2\"]),\n", - " expand(\"{output_dir}/{sample}_Aligned.sortedByCoord.out.bam\",\n", - " output_dir=config[\"output_dir\"],\n", - " sample=config[\"samples\"])\n", - "```\n" + "![alt text](../../Q-snakemake-wf.png)" ] }, { @@ -1048,7 +409,7 @@ "source": [ "#### Use Case 5: Cloud migration assistance. \n", "\n", - "Cloud migration of bioinformatics pipelines involves moving data and computational workflows to the cloud. This allows researchers to use scalable and powerful cloud resources, making it easier to process large datasets and perform complex analyses efficiently and cost-effectively. Prompting can be used to facilitate the migration of pipelines to the cloud. As Amazon Q specializes in queries and tasks related to AWS, let's prompt the model to facilitate migration to AWS infrastructure. " + "Cloud migration of bioinformatics pipelines involves moving data and computational workflows to the cloud. This allows researchers to use scalable and powerful cloud resources, making it easier to process large datasets and perform complex analyses efficiently and cost-effectively. Prompting can be used to facilitate the migration of pipelines to the cloud. As Amazon Q Developer specializes in queries and tasks related to AWS, let's prompt the model to facilitate migration to AWS infrastructure. " ] }, { @@ -1076,309 +437,7 @@ "source": [ "##### **Response**\n", "\n", - "I'll modify the Snakemake workflow for AWS execution and include security best practices. Here's the implementation: [[1]](https://repost.aws/articles/ARVnzwqlzzSU279RlcKvJU5w/securing-your-cloud-infrastructure)\n", - "\n", - "```python\n", - "# Snakefile\n", - "\n", - "# Import AWS utilities\n", - "from snakemake.remote.S3 import RemoteProvider\n", - "import boto3\n", - "\n", - "# Configure AWS\n", - "S3 = RemoteProvider(\n", - " access_key_id=os.environ.get(\"AWS_ACCESS_KEY_ID\"),\n", - " secret_access_key=os.environ.get(\"AWS_SECRET_ACCESS_KEY\")\n", - ")\n", - "\n", - "# Import configuration\n", - "configfile: \"config.yaml\"\n", - "\n", - "# Wildcard constraints\n", - "wildcard_constraints:\n", - " sample=\"[A-Za-z0-9_-]+\"\n", - "\n", - "# Target rule\n", - "rule all:\n", - " input:\n", - " expand(\n", - " \"s3://{bucket}/{output_prefix}/{sample}_Aligned.sortedByCoord.out.bam\",\n", - " bucket=config[\"aws\"][\"bucket\"],\n", - " output_prefix=config[\"aws\"][\"output_prefix\"],\n", - " sample=config[\"samples\"]\n", - " )\n", - "\n", - "rule star_align:\n", - " input:\n", - " r1 = S3.remote(\"s3://{bucket}/{input_prefix}/{sample}_R1.fastq\",\n", - " keep_local=True),\n", - " r2 = S3.remote(\"s3://{bucket}/{input_prefix}/{sample}_R2.fastq\",\n", - " keep_local=True)\n", - " output:\n", - " bam = S3.remote(\n", - " \"s3://{bucket}/{output_prefix}/{sample}_Aligned.sortedByCoord.out.bam\"\n", - " )\n", - " params:\n", - " genome_dir = config[\"aws\"][\"genome_dir\"],\n", - " out_prefix = \"temp/{sample}_\",\n", - " region = config[\"aws\"][\"region\"]\n", - " threads: 8\n", - " resources:\n", - " mem_mb = 32000,\n", - " time = \"02:00:00\"\n", - " log:\n", - " S3.remote(\"s3://{bucket}/{output_prefix}/logs/{sample}.log\")\n", - " container:\n", - " \"docker://quay.io/biocontainers/star:2.7.10a--h9ee0642_0\"\n", - " shell:\n", - " \"\"\"\n", - " # Create temporary directory\n", - " mkdir -p temp\n", - "\n", - " # Run STAR\n", - " STAR --genomeDir {params.genome_dir} \\\n", - " --readFilesIn {input.r1} {input.r2} \\\n", - " --outFileNamePrefix {params.out_prefix} \\\n", - " --runThreadN {threads} \\\n", - " --outSAMtype BAM SortedByCoordinate \\\n", - " 2> {log}\n", - "\n", - " # Cleanup\n", - " rm -rf temp\n", - " \"\"\"\n", - "```\n", - "\n", - "Configuration file (config.yaml):\n", - "```yaml\n", - "# config.yaml\n", - "\n", - "aws:\n", - " bucket: \"XXXXXXXXXXXXXXXXXX\"\n", - " region: \"us-west-2\"\n", - " input_prefix: \"raw-data\"\n", - " output_prefix: \"aligned-data\"\n", - " genome_dir: \"s3://my-genomics-bucket/reference/genome\"\n", - " instance_type: \"m5.4xlarge\"\n", - " subnet: \"subnet-xxxxx\"\n", - " security_group: \"sg-xxxxx\"\n", - "\n", - "samples:\n", - " - \"sample1\"\n", - " - \"sample2\"\n", - " - \"sample3\"\n", - "```\n", - "\n", - "AWS Batch configuration (batch.yaml):\n", - "```yaml\n", - "# batch.yaml\n", - "\n", - "__default__:\n", - " job_queue: \"genomics-queue\"\n", - " job_definition: \"snakemake-job\"\n", - " region: \"us-west-2\"\n", - "\n", - "star_align:\n", - " instance_type: \"m5.4xlarge\"\n", - " vcpus: 8\n", - " memory: 32000\n", - " timeout: 7200\n", - "```\n", - "\n", - "IAM Policy for Snakemake:\n", - "```json\n", - "{\n", - " \"Version\": \"2012-10-17\",\n", - " \"Statement\": [\n", - " {\n", - " \"Effect\": \"Allow\",\n", - " \"Action\": [\n", - " \"s3:GetObject\",\n", - " \"s3:PutObject\",\n", - " \"s3:ListBucket\"\n", - " ],\n", - " \"Resource\": [\n", - " \"arn:aws:s3:::my-genomics-bucket/*\",\n", - " \"arn:aws:s3:::my-genomics-bucket\"\n", - " ]\n", - " },\n", - " {\n", - " \"Effect\": \"Allow\",\n", - " \"Action\": [\n", - " \"batch:SubmitJob\",\n", - " \"batch:DescribeJobs\",\n", - " \"batch:ListJobs\"\n", - " ],\n", - " \"Resource\": \"*\"\n", - " }\n", - " ]\n", - "}\n", - "```\n", - "\n", - "AWS Batch submission script:\n", - "```python\n", - "#!/usr/bin/env python3\n", - "# submit_aws.py\n", - "\n", - "import boto3\n", - "import yaml\n", - "import os\n", - "\n", - "def submit_workflow():\n", - " session = boto3.Session(\n", - " region_name=config[\"aws\"][\"region\"]\n", - " )\n", - " \n", - " batch = session.client('batch')\n", - " \n", - " response = batch.submit_job(\n", - " jobName='snakemake-workflow',\n", - " jobQueue=config[\"aws\"][\"job_queue\"],\n", - " jobDefinition=config[\"aws\"][\"job_definition\"],\n", - " containerOverrides={\n", - " 'command': [\n", - " 'snakemake',\n", - " '--use-conda',\n", - " '--conda-frontend', 'mamba',\n", - " '--jobs', '20',\n", - " '--default-remote-provider', 'S3',\n", - " '--default-remote-prefix', config[\"aws\"][\"bucket\"]\n", - " ]\n", - " }\n", - " )\n", - " \n", - " print(f\"Submitted job: {response['jobId']}\")\n", - "\n", - "if __name__ == \"__main__\":\n", - " with open(\"config.yaml\") as f:\n", - " config = yaml.safe_load(f)\n", - " submit_workflow()\n", - "```\n", - "\n", - "Security Best Practices:\n", - "\n", - "1. **IAM Configuration**:\n", - "```python\n", - "# Use AWS SDK with minimal permissions\n", - "def get_aws_session():\n", - " return boto3.Session(\n", - " region_name=os.environ.get(\"AWS_REGION\"),\n", - " profile_name=os.environ.get(\"AWS_PROFILE\")\n", - " )\n", - "```\n", - "\n", - "2. **Secrets Management**:\n", - "```python\n", - "# Use AWS Secrets Manager for sensitive data\n", - "def get_secret():\n", - " session = get_aws_session()\n", - " client = session.client('secretsmanager')\n", - " response = client.get_secret_value(\n", - " SecretId='genomics/credentials'\n", - " )\n", - " return json.loads(response['SecretString'])\n", - "```\n", - "\n", - "3. **Network Security**:\n", - "```yaml\n", - "# VPC configuration in config.yaml\n", - "vpc:\n", - " subnet_id: \"subnet-xxxxx\"\n", - " security_group_ids: \n", - " - \"sg-xxxxx\"\n", - " enable_vpc_endpoints: true\n", - "```\n", - "\n", - "4. **Data Encryption**:\n", - "```python\n", - "# Enable S3 encryption\n", - "S3 = RemoteProvider(\n", - " access_key_id=os.environ.get(\"AWS_ACCESS_KEY_ID\"),\n", - " secret_access_key=os.environ.get(\"AWS_SECRET_ACCESS_KEY\"),\n", - " config_kwargs={'s3': {'ServerSideEncryption': 'AES256'}}\n", - ")\n", - "```\n", - "\n", - "5. **Logging and Monitoring**:\n", - "```python\n", - "# Add CloudWatch logging\n", - "rule star_align:\n", - " # ... other directives ...\n", - " log:\n", - " cloudwatch = \"aws_logs\"\n", - " shell:\n", - " \"\"\"\n", - " aws logs create-log-stream \\\n", - " --log-group-name genomics-workflow \\\n", - " --log-stream-name {wildcards.sample}\n", - "\n", - " STAR ... 2>&1 | tee >(aws logs put-log-events \\\n", - " --log-group-name genomics-workflow \\\n", - " --log-stream-name {wildcards.sample})\n", - " \"\"\"\n", - "```\n", - "\n", - "Additional Security Measures:\n", - "\n", - "1. **VPC Endpoint Configuration**:\n", - "```bash\n", - "# Create VPC endpoints for AWS services\n", - "aws ec2 create-vpc-endpoint \\\n", - " --vpc-id vpc-xxxxx \\\n", - " --service-name com.amazonaws.region.s3 \\\n", - " --route-table-ids rtb-xxxxx\n", - "```\n", - "\n", - "2. **Data Lifecycle Policy**:\n", - "```json\n", - "{\n", - " \"Rules\": [\n", - " {\n", - " \"ID\": \"Delete old data\",\n", - " \"Status\": \"Enabled\",\n", - " \"ExpirationInDays\": 30\n", - " }\n", - " ]\n", - "}\n", - "```\n", - "\n", - "3. **AWS KMS Encryption**:\n", - "```python\n", - "# Use KMS for encryption\n", - "S3 = RemoteProvider(\n", - " config_kwargs={\n", - " 's3': {\n", - " 'ServerSideEncryption': 'aws:kms',\n", - " 'SSEKMSKeyId': 'arn:aws:kms:region:account:key/key-id'\n", - " }\n", - " }\n", - ")\n", - "```\n", - "\n", - "4. **Access Logging**:\n", - "```yaml\n", - "# Enable S3 access logging\n", - "s3_logging:\n", - " target_bucket: \"XXXXXXXXXX\"\n", - " target_prefix: \"genomics-logs/\"\n", - "```\n", - "\n", - "These security measures ensure:\n", - "- Secure data transmission and storage\n", - "- Proper access control\n", - "- Audit trails\n", - "- Network isolation\n", - "- Encryption at rest and in transit\n", - "- Compliance with security best practices\n", - "\n", - "Remember to:\n", - "- Regularly rotate credentials\n", - "- Monitor access logs\n", - "- Update security groups\n", - "- Patch software dependencies\n", - "- Implement least privilege access\n", - "- Use private subnets where possible\n", - "- Enable AWS CloudTrail\n", - "- Implement resource tagging" + "![alt text](../../Q-snakemake-cloud.png)" ] }, { @@ -1407,7 +466,7 @@ "source": [ "#### Use Case 6: Write code from scratch. \n", "\n", - "Prompting an AI tool like Amazon Q to write a script can be incredibly useful for several reasons. It significantly enhances efficiency by reducing the time and effort needed to create code from scratch. Additionally, it serves as a valuable learning aid, helping users understand coding practices and library usage through generated examples. The generated scripts provide a flexible starting point that can be easily customized to meet specific requirements, allowing users to quickly adapt and expand their projects.\n", + "Prompting an AI tool like Amazon Q Developer to write a script can be incredibly useful for several reasons. It significantly enhances efficiency by reducing the time and effort needed to create code from scratch. Additionally, it serves as a valuable learning aid, helping users understand coding practices and library usage through generated examples. The generated scripts provide a flexible starting point that can be easily customized to meet specific requirements, allowing users to quickly adapt and expand their projects.\n", "\n", "##### **Prompt** \n", "\n", @@ -1420,196 +479,7 @@ "source": [ "##### **Response**\n", "\n", - "I'll help you create an R script for visualizing gene expression data using ggplot2. Here's a comprehensive solution: [[1]](https://stackoverflow.com/questions/70765003)\n", - "\n", - "```R\n", - "# Load required libraries\n", - "library(ggplot2)\n", - "library(tidyr)\n", - "library(dplyr)\n", - "library(RColorBrewer)\n", - "\n", - "# Read the gene expression data\n", - "# Assuming your CSV file has genes in columns and samples in rows\n", - "gene_data <- read.csv(\"expression_data.csv\")\n", - "\n", - "# Convert data from wide to long format for plotting\n", - "gene_data_long <- gene_data %>%\n", - " gather(key = \"Gene\", value = \"Expression\", -Sample_ID)\n", - "\n", - "# Calculate mean and standard error for each gene\n", - "gene_stats <- gene_data_long %>%\n", - " group_by(Gene) %>%\n", - " summarise(\n", - " mean_expr = mean(Expression),\n", - " se = sd(Expression) / sqrt(n()),\n", - " .groups = 'drop'\n", - " )\n", - "\n", - "# Create the basic bar plot\n", - "ggplot(gene_stats, aes(x = Gene, y = mean_expr)) +\n", - " # Add bars\n", - " geom_bar(stat = \"identity\", fill = \"steelblue\", alpha = 0.7, width = 0.7) +\n", - " # Add error bars\n", - " geom_errorbar(aes(ymin = mean_expr - se, ymax = mean_expr + se),\n", - " width = 0.2) +\n", - " # Customize theme\n", - " theme_minimal() +\n", - " theme(\n", - " axis.text.x = element_text(angle = 45, hjust = 1, size = 10),\n", - " axis.title = element_text(size = 12, face = \"bold\"),\n", - " plot.title = element_text(size = 14, face = \"bold\", hjust = 0.5),\n", - " panel.grid.major.x = element_blank()\n", - " ) +\n", - " # Add labels\n", - " labs(\n", - " title = \"Gene Expression Levels\",\n", - " x = \"Genes\",\n", - " y = \"Expression Level\"\n", - " )\n", - "\n", - "# Save the plot\n", - "ggsave(\"gene_expression_plot.pdf\", width = 10, height = 6)\n", - "\n", - "# Alternative visualization: Create a heatmap\n", - "# Reorder genes by expression level\n", - "gene_data_heatmap <- gene_data_long %>%\n", - " group_by(Gene) %>%\n", - " mutate(mean_expr = mean(Expression)) %>%\n", - " ungroup() %>%\n", - " arrange(desc(mean_expr))\n", - "\n", - "ggplot(gene_data_heatmap, aes(x = Gene, y = Sample_ID, fill = Expression)) +\n", - " geom_tile() +\n", - " scale_fill_gradientn(colors = brewer.pal(9, \"YlOrRd\")) +\n", - " theme_minimal() +\n", - " theme(\n", - " axis.text.x = element_text(angle = 45, hjust = 1),\n", - " axis.title = element_text(face = \"bold\"),\n", - " plot.title = element_text(hjust = 0.5)\n", - " ) +\n", - " labs(\n", - " title = \"Gene Expression Heatmap\",\n", - " x = \"Genes\",\n", - " y = \"Samples\",\n", - " fill = \"Expression\\nLevel\"\n", - " )\n", - "\n", - "# Save the heatmap\n", - "ggsave(\"gene_expression_heatmap.pdf\", width = 12, height = 8)\n", - "\n", - "# Add statistical analysis\n", - "# Perform ANOVA to test for differences between genes\n", - "anova_result <- aov(Expression ~ Gene, data = gene_data_long)\n", - "anova_summary <- summary(anova_result)\n", - "\n", - "# Add significance levels to the plot\n", - "gene_stats <- gene_stats %>%\n", - " mutate(\n", - " significance = case_when(\n", - " mean_expr > quantile(mean_expr, 0.75) ~ \"***\",\n", - " mean_expr > quantile(mean_expr, 0.5) ~ \"**\",\n", - " mean_expr > quantile(mean_expr, 0.25) ~ \"*\",\n", - " TRUE ~ \"ns\"\n", - " )\n", - " )\n", - "\n", - "# Create enhanced bar plot with significance levels\n", - "ggplot(gene_stats, aes(x = reorder(Gene, -mean_expr), y = mean_expr)) +\n", - " geom_bar(stat = \"identity\", fill = \"steelblue\", alpha = 0.7, width = 0.7) +\n", - " geom_errorbar(aes(ymin = mean_expr - se, ymax = mean_expr + se),\n", - " width = 0.2) +\n", - " geom_text(aes(label = significance, y = mean_expr + se),\n", - " vjust = -0.5) +\n", - " theme_minimal() +\n", - " theme(\n", - " axis.text.x = element_text(angle = 45, hjust = 1, size = 10),\n", - " axis.title = element_text(size = 12, face = \"bold\"),\n", - " plot.title = element_text(size = 14, face = \"bold\", hjust = 0.5),\n", - " panel.grid.major.x = element_blank()\n", - " ) +\n", - " labs(\n", - " title = \"Gene Expression Levels with Significance\",\n", - " x = \"Genes\",\n", - " y = \"Expression Level\",\n", - " caption = \"Significance levels: *** p<0.001, ** p<0.01, * p<0.05, ns: not significant\"\n", - " )\n", - "\n", - "# Save the enhanced plot\n", - "ggsave(\"gene_expression_plot_with_significance.pdf\", width = 12, height = 8)\n", - "\n", - "# Add interactive features using plotly\n", - "library(plotly)\n", - "\n", - "interactive_plot <- ggplotly(\n", - " ggplot(gene_stats, aes(x = reorder(Gene, -mean_expr), y = mean_expr,\n", - " text = paste(\"Gene:\", Gene,\n", - " \"\\nExpression:\", round(mean_expr, 2),\n", - " \"\\nSE:\", round(se, 2)))) +\n", - " geom_bar(stat = \"identity\", fill = \"steelblue\", alpha = 0.7) +\n", - " geom_errorbar(aes(ymin = mean_expr - se, ymax = mean_expr + se),\n", - " width = 0.2) +\n", - " theme_minimal() +\n", - " labs(title = \"Interactive Gene Expression Plot\",\n", - " x = \"Genes\",\n", - " y = \"Expression Level\")\n", - ")\n", - "\n", - "# Save the interactive plot as HTML\n", - "htmlwidgets::saveWidget(interactive_plot, \"interactive_gene_expression.html\")\n", - "```\n", - "\n", - "This script provides:\n", - "\n", - "1. **Basic Visualization**:\n", - " - Bar plot with error bars\n", - " - Clear labels and titles\n", - " - Customized theme\n", - "\n", - "2. **Alternative Visualizations**:\n", - " - Heatmap for multiple samples\n", - " - Interactive plot using plotly\n", - "\n", - "3. **Statistical Analysis**:\n", - " - Calculation of means and standard errors\n", - " - ANOVA testing\n", - " - Significance level indicators\n", - "\n", - "4. **Customization Options**:\n", - " - Color schemes\n", - " - Text formatting\n", - " - Plot dimensions\n", - "\n", - "To use this script:\n", - "\n", - "1. Prepare your data in CSV format with columns:\n", - " - Sample_ID\n", - " - Gene names as column headers\n", - " - Expression values in cells\n", - "\n", - "2. Adjust file paths:\n", - "```R\n", - "gene_data <- read.csv(\"your_file_path.csv\")\n", - "```\n", - "\n", - "3. Customize visualization parameters:\n", - "```R\n", - "# Change colors\n", - "ggplot(...) +\n", - " geom_bar(fill = \"your_color\")\n", - "\n", - "# Adjust plot dimensions\n", - "ggsave(\"output.pdf\", width = your_width, height = your_height)\n", - "```\n", - "\n", - "4. Add additional analyses as needed:\n", - "```R\n", - "# Add correlation analysis\n", - "cor_matrix <- cor(gene_data[,-1])\n", - "corrplot(cor_matrix, method = \"color\")\n", - "```\n", - "\n", - "The script includes error handling and produces publication-quality figures with statistical annotations." + "![alt text](../../Q-R-script.png)" ] }, { @@ -1632,7 +502,7 @@ "source": [ "#### Use Case 7: Error debugging\n", "\n", - "Amazon Q can also be used to identify and fix errors in your code. This is highly beneficial as it can save time and identify errors that may have been difficult resolve. " + "Amazon Q Developer can also be used to identify and fix errors in your code. This is highly beneficial as it can save time and identify errors that may have been difficult resolve. " ] }, { @@ -1664,7 +534,7 @@ "source": [ "### Conclusions\n", "\n", - "Congrats! You have successfully experimented with the features of Amazon Q. We hope you continue leveraging the powers of GenAI and Amazon Q to drive impactful results in your projects.\n", + "Congrats! You have successfully experimented with the features of Amazon Q. We hope you continue leveraging the powers of GenAI and Amazon Q Developer to drive impactful results in your projects.\n", "\n", "### Clean Up\n", "\n",