Transcription data prep

The transcription data prep scripts download YouTube video transcripts and prepare them for use with the Semantic Search with OpenAI Embeddings and Functions sample.

The transcription data prep scripts have been tested on the latest releases Windows 11, macOS Ventura and Ubuntu 22.04 (and above).

Create required Azure OpenAI Service resources

Important

We suggest you update the Azure CLI to the latest version to ensure compatibility with OpenAI See Documentation

Create a resource group

Note

For these instructions we're using the resource group named "semantic-video-search" in East US. You can change the name of the resource group, but when changing the location for the resources, check the model availability table.

az group create --name semantic-video-search --location eastus

Create an Azure OpenAI Service resource.

az cognitiveservices account create --name semantic-video-openai --resource-group semantic-video-search \
    --location eastus --kind OpenAI --sku s0

Get the endpoint and keys for usage in this application

az cognitiveservices account show --name semantic-video-openai \
   --resource-group  semantic-video-search | jq -r .properties.endpoint
az cognitiveservices account keys list --name semantic-video-openai \
   --resource-group semantic-video-search | jq -r .key1

Deploy the following models:
- text-embedding-ada-002 version 2 or greater, named text-embedding-ada-002
- gpt-35-turbo version 0613 or greater, named gpt-35-turbo

az cognitiveservices account deployment create \
    --name semantic-video-openai \
    --resource-group  semantic-video-search \
    --deployment-name text-embedding-ada-002 \
    --model-name text-embedding-ada-002 \
    --model-version "2"  \
    --model-format OpenAI \
    --scale-settings-scale-type "Standard"
az cognitiveservices account deployment create \
    --name semantic-video-openai \
    --resource-group  semantic-video-search \
    --deployment-name gpt-35-turbo \
    --model-name gpt-35-turbo \
    --model-version "0613"  \
    --model-format OpenAI \
    --sku-capacity 100 \
    --sku-name "Standard"

Required software

Python 3.9 or greater

Environment variables

The following environment variables are required to run the YouTube transcription data prep scripts.

On Windows

Recommend adding the variables to your user environment variables. Windows Start > Edit the system environment variables > Environment Variables > User variables for [USER] > New.

AZURE_OPENAI_API_KEY  \<your Azure OpenAI Service API key>
AZURE_OPENAI_ENDPOINT \<your Azure OpenAI Service endpoint>
AZURE_OPENAI_MODEL_DEPLOYMENT_NAME \<your Azure OpenAI Service model deployment name>
GOOGLE_DEVELOPER_API_KEY = \<your Google developer API key>

On Linux and macOS

Recommend adding the following exports to your ~/.bashrc or ~/.zshrc file.

export AZURE_OPENAI_API_KEY=<your Azure OpenAI Service API key>
export AZURE_OPENAI_ENDPOINT=<your Azure OpenAI Service endpoint>
export AZURE_OPENAI_MODEL_DEPLOYMENT_NAME=<your Azure OpenAI Service model deployment name>
export GOOGLE_DEVELOPER_API_KEY=<your Google developer API key>

Install the required Python libraries

Install the git client if it's not already installed.

From a Terminal window, clone the sample to your preferred repo folder.

git clone https://github.com/gloveboxes/semanic-search-openai-embeddings-functions.git

Navigate to the data_prep folder.

cd semanic-search-openai-embeddings-functions/src/data_prep

Create a Python virtual environment.

On Windows:
```
python -m venv .venv
```
On macOS and Linux:
```
python3 -m venv .venv
```
Activate the Python virtual environment.

On Windows:
```
.venv\Scripts\activate
```
On macOS and Linux:
```
source .venv/bin/activate
```

Install the required libraries.

On windows:

pip install -r requirements.txt

On macOS and Linux:

pip3 install -r requirements.txt

Run the YouTube transcription data prep scripts

On windows

.\transcripts_prepare.ps1

On macOS and Linux

./transcripts_prepare.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Transcription data prep

Create required Azure OpenAI Service resources

Required software

Environment variables

On Windows

On Linux and macOS

Install the required Python libraries

Run the YouTube transcription data prep scripts

On windows

On macOS and Linux

Files

README.md

Latest commit

History

README.md

File metadata and controls

Transcription data prep

Create required Azure OpenAI Service resources

Required software

Environment variables

On Windows

On Linux and macOS

Install the required Python libraries

Run the YouTube transcription data prep scripts

On windows

On macOS and Linux