Update documentation (#185)

* update documentation * remove obsolete notebooks * comment on data usage and internet access
ssciwr · Feb 19, 2024 · 8752398 · 8752398
1 parent bec845e
commit 8752398
Show file tree

Hide file tree

Showing 37 changed files with 986 additions and 2,334 deletions.
diff --git a/README.md b/README.md
@@ -119,7 +119,7 @@ Place the data files and google cloud vision API key in your google drive to acc
 
 ## Features
 ### Text extraction
-The text is extracted from the images using [google-cloud-vision](https://cloud.google.com/vision). For this, you need an API key. Set up your google account following the instructions on the google Vision AI website or as described [here](docs/google_Cloud_Vision_API/set_up_credentials.md).
+The text is extracted from the images using [google-cloud-vision](https://cloud.google.com/vision). For this, you need an API key. Set up your google account following the instructions on the google Vision AI website or as described [here](docs/source/set_up_credentials.md).
 You then need to export the location of the API key as an environment variable:
 ```
 export GOOGLE_APPLICATION_CREDENTIALS="location of your .json"
@@ -147,3 +147,34 @@ Color detection is carried out using [colorgram.py](https://github.com/obskyr/co
 ### Cropping of posts
 
 Social media posts can automatically be cropped to remove further comments on the page and restrict the textual content to the first comment only.
+
+
+# FAQ
+
+## What happens to the images that are sent to google Cloud Vision?
+
+According to the [google Vision API](https://cloud.google.com/vision/docs/data-usage), the images that are uploaded and analysed are not stored and not shared with third parties:
+
+> We won't make the content that you send available to the public. We won't share the content with any third party. The content is only used by Google as necessary to provide the Vision API service. Vision API complies with the Cloud Data Processing Addendum.
+
+> For online (immediate response) operations (`BatchAnnotateImages` and `BatchAnnotateFiles`), the image data is processed in memory and not persisted to disk.
+For asynchronous offline batch operations (`AsyncBatchAnnotateImages` and `AsyncBatchAnnotateFiles`), we must store that image for a short period of time in order to perform the analysis and return the results to you. The stored image is typically deleted right after the processing is done, with a failsafe Time to live (TTL) of a few hours.
+Google also temporarily logs some metadata about your Vision API requests (such as the time the request was received and the size of the request) to improve our service and combat abuse.
+
+## What happens to the text that is sent to google Translate?
+
+According to [google Translate](https://cloud.google.com/translate/data-usage), the data is not stored after processing and not made available to third parties:
+
+> We will not make the content of the text that you send available to the public. We will not share the content with any third party. The content of the text is only used by Google as necessary to provide the Cloud Translation API service. Cloud Translation API complies with the Cloud Data Processing Addendum.
+
+> When you send text to Cloud Translation API, text is held briefly in-memory in order to perform the translation and return the results to you.
+
+## What happens if I don't have internet access - can I still use ammico?
+
+Some features of ammico require internet access; a general answer to this question is not possible, some services require an internet connection, others can be used offline:
+
+- Text extraction: To extract text from images, and translate the text, the data needs to be processed by google Cloud Vision and google Translate, which run in the cloud. Without internet access, text extraction and translation is not possible.
+- Image summary and query: After an initial download of the models, the `summary` module does not require an internet connection.
+- Facial expressions: After an initial download of the models, the `faces` module does not require an internet connection.
+- Multimodal search: After an initial download of the models, the `multimodal_search` module does not require an internet connection.
+- Color analysis: The `color` module does not require an internet connection.
diff --git a/ammico/notebooks/DemoNotebook_ammico.ipynb b/ammico/notebooks/DemoNotebook_ammico.ipynb
@@ -39,21 +39,50 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "You can download a dataset for test purposes."
+    "## Use a test dataset\n",
+    "You can download a dataset for test purposes. Skip this step if you use your own data."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
-   "source": []
+   "source": [
+    "from datasets import load_dataset\n",
+    "from pathlib import Path\n",
+    "\n",
+    "# If the dataset is gated/private, make sure you have run huggingface-cli login\n",
+    "dataset = load_dataset(\"iulusoy/test-images\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Next you need to provide a path for the saved images - a folder where the data is stored locally. This directory is automatically created if it does not exist."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "data_path = \"./data-test\"\n",
+    "data_path = Path(data_path)\n",
+    "data_path.mkdir(parents=True, exist_ok=True)\n",
+    "# now save the files from the Huggingface dataset as images into the data_path folder\n",
+    "for i, image in enumerate(dataset[\"train\"][\"image\"]):\n",
+    "    filename = \"img\" + str(i) + \".png\"\n",
+    "    image.save(data_path / filename)"
+   ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Import the ammico package."
+    "## Import the ammico package."
    ]
   },
   {
@@ -86,15 +115,13 @@
    "source": [
     "# Step 0: Create and set a Google Cloud Vision Key\n",
     "\n",
-    "Please note that for the [Google Cloud Vision API](https://cloud.google.com/vision/docs/setup) (the TextDetector class) you need to set a key in order to process the images. This key is ideally set as an environment variable using for example\n",
+    "Please note that for the [Google Cloud Vision API](https://cloud.google.com/vision/docs/setup) (the TextDetector class) you need to set a key in order to process the images. A key is generated following [these instructions](https://ssciwr.github.io/AMMICO/build/html/create_API_key_link.html). This key is ideally set as an environment variable using for example\n",
     "```\n",
     "os.environ[\n",
     "    \"GOOGLE_APPLICATION_CREDENTIALS\"\n",
     "] = \"/content/drive/MyDrive/misinformation-data/misinformation-campaign-981aa55a3b13.json\"\n",
     "```\n",
-    "where you place the key on your Google Drive if running on colab, or place it in a local folder on your machine.\n",
-    "\n",
-    "To set up the key, see [here]()."
+    "where you place the key on your Google Drive if running on colab, or place it in a local folder on your machine."
    ]
   },
   {
@@ -103,8 +130,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# os.environ[\"GOOGLE_APPLICATION_CREDENTIALS\"] = \"/content/drive/MyDrive/misinformation-data/misinformation-campaign-981aa55a3b13.json\"\n",
-    "os.environ[\"GOOGLE_APPLICATION_CREDENTIALS\"] = \"../../data/misinformation-campaign-981aa55a3b13.json\""
+    "# os.environ[\"GOOGLE_APPLICATION_CREDENTIALS\"] = \"/content/drive/MyDrive/misinformation-data/misinformation-campaign-981aa55a3b13.json\""
    ]
   },
   {
@@ -122,7 +148,9 @@
     "| `limit` | `int` | maximum number of files to read (defaults to `20`, for all images set to `None` or `-1`) |\n",
     "| `random_seed` | `str` | the random seed for shuffling the images; applies when only a few images are read and the selection should be preserved (defaults to `None`) |\n",
     "\n",
-    "The `find_files` function returns a nested dict that contains the file ids and the paths to the files and is empty otherwise. This dict is filled step by step with more data as each detector class is run on the data (see below)."
+    "The `find_files` function returns a nested dict that contains the file ids and the paths to the files and is empty otherwise. This dict is filled step by step with more data as each detector class is run on the data (see below).\n",
+    "\n",
+    "If you downloaded the test dataset above, you can directly provide the path you already set for the test directory, `data_path`."
    ]
   },
   {
@@ -133,20 +161,11 @@
    "source": [
     "image_dict = ammico.find_files(\n",
     "    # path=\"/content/drive/MyDrive/misinformation-data/\",\n",
-    "    path=\"../../data/\",\n",
+    "    path=data_path.as_posix(),\n",
     "    limit=15,\n",
     ")"
    ]
   },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "image_dict"
-   ]
-  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -235,15 +254,6 @@
     "        image_df.to_csv(dump_file)"
    ]
   },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "len(image_dict)"
-   ]
-  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -296,15 +306,6 @@
     "        image_df.to_csv(dump_file)"
    ]
   },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "image_dict"
-   ]
-  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -383,7 +384,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "os.environ[\"GOOGLE_APPLICATION_CREDENTIALS\"] = \"/content/drive/MyDrive/misinformation-data/misinformation-campaign-981aa55a3b13.json\"\n"
+    "# os.environ[\"GOOGLE_APPLICATION_CREDENTIALS\"] = \"/content/drive/MyDrive/misinformation-data/misinformation-campaign-981aa55a3b13.json\"\n"
    ]
   },
   {
@@ -927,7 +928,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "import importlib_resources                                                                       # only requare for image query example\n",
+    "import importlib_resources                                                                       # only require for image query example\n",
     "image_example_query = str(importlib_resources.files(\"ammico\") / \"data\" / \"test-crop-image.png\")   # creating the path to the image for the image query example\n",
     "\n",
     "search_query = [\n",
@@ -1262,7 +1263,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.9.16"
+   "version": "3.11.5"
   }
  },
  "nbformat": 4,