Skip to content

Sample data shared across several Azure Search sample code projects.

Notifications You must be signed in to change notification settings

jorgelunams/azure-search-sample-data

 
 

Repository files navigation

Azure Cognitive Search Sample Data

This repository contains data files used in Azure Cognitive Search quickstarts, tutorials, and examples. Each folder represents a different sample data set. Most sample data is used for indexer and AI enrichment scenarios and is typically uploaded to Azure Storage so that it can be accessed by an indexer.

This repository also contains an ARCHIVE folder for previously published data files that are no longer used in samples or docs.

It previously included a STOPWORDS.MD file. This file is now in the Reference section of the Azure Cognitive Search documentation. Stopwords reference (Microsoft analyzers) is the new location.

AI-enrichment-mixed-media

This folder contains 14 files of mixed content types, including HTML, JPG, PDF, PowerPoint, Word, PNG, and TXT files. These files are used to demonstrate the breadth of skillset processing of multiple content types using a combination of built-in skills. This sample data is intended for upload to an Azure Blob storage container, and then referenced from an indexer's data source object.

Used in: Quickstart: Create a skillset

Clinical trials

This folder contains three subfolders of clinical trials data from https://clinicaltrials.gov. Subfolders contain large and small numbers of files, plus a JSON version.

Folder Description
Clinical trials JSON Consists of 8 semi-structured JSON files that you can upload to Azure Blob storage, and then import using the Azure Blob indexer.
Clinical trials PDF 19 Consists of 19 PDF files used in AI enrichment lessons. This data set can be used in AI enrichment pipelines on the free tier, using the free allocation of daily transactions per indexer.
Clinical trials PDF 107 Consists of 107 PDF files used in knowledge mining labs and tutorials. Processing this quantity of documents requires an attached Cognitive Services all-in-one resource.

Used in: Index Azure JSON blobs tutorial

Famous-speeches

This folder includes 4 PDF files of famous American speeches by Abraham Lincoln, John F. Kennedy, and Martin Luther King. These files are used to demonstrate entity recognition and custom entity lookup. The custom entity definition file that provides the lookup entities is located with the Postman collection.

Used in: Custom Entity Lookup skill (Postman collection)

Hotels data

The Hotels folder contains fictitious demo data for quickstarts, tutorials, and code examples. This is the default data set for many Azure Cognitive Search samples. It consists of 50 hotels across the United States and includes data to support all query types, including geospatial filters. It is structured and sized to run on the free tier.

Hotels demo data is provided in multiple formats to support different consumption models. The data is identical regardless of how you load it. Data files are in JSON, but there are several versions depending on whether you are uploading it Azure Cosmos DB or pushing it to an index in Azure Cognitive Search.

Use the following files to create the hotels sample on your search service:

  • Hotels.postman_collection.json - Using Postman, import this collection to execute requests that create and populate the Hotels index using JSON documents.

  • Hotels_IndexDefinition.JSON - A standalone JSON file containing just the index. This index definition is equivalent to the hosted index on azs-playground search service.

  • HotelsData_toAzureSearch.JSON - A standalone JSON file containing documents for 50 hotels and related room information.

  • HotelsData_toCosmosDB.JSON - JSON used to populate an Azure Cosmos DB with the Hotels sample data. This can be used as a data source for an indexer to pull data into the Hotels index.

Hotels-json-documents

This sample data set consists of 5 JSON documents containing structured JSON, used for evaluating or testing JSON blob indexing. Each file consists of hotel information, an address complex field, and a rooms complex collection. The blob indexer can detect and match this JSON structure through equivalent fields in a search index.

Hotels-sql

This is a SQL script that creates a database, a table, and inserts 12 rows of partial hotel information.

HotelReviews

This folder contains two files:

  • A CSV file provides data consisting of customer reviews of various fictional hotels in Europe. You can use this data in AI enrichment tutorials, applying sentiment analysis, language detection, and text translation. When indexing content from a CSV file, be sure to select a parsing mode so that individual documents can be created for each line in the file.

  • A JSON file provides a skillset definition.

Used in: Create a knowledge store

NASA e-books

Content from NASA's earth book (February 2019) is used in conceptual examples that explain semantic search and answers. This folder contains a collection of PDFs from NASA's downloadable books site. The folder includes intact versions of the entire book as single PDF file. A subfolder contains per-page extractions as separate PDF files for both images and text, as well as text-only pages.

The first 10 PDFs in \azure-search-sample-data\nasa-e-book\text-only are used in entity recognition and entity linking skills processing demos.

Used in: Demo skills (Postman collections for Entity Recognition and Entity Linking)

Unsplash images

Images from https://unsplash.com/s/photos/landmark and https://unsplash.com/s/photos/ are used in OCR and image analysis skills processing demos. There are ten images in each folder.

  • The "jpg-landmarks" folder contains photos of well-known buildings and structures. It's used to demonstrate image analysis.

  • The "jpg-signs" folder contains photos that include signs and is used to demonstrate OCR skillset processing.

Used in: Demo skills (Postman collections for OCR and Image Analysis)

Spanish museums

This folder includes 10 Word document files in Spanish and French, five in each language. Content consists of museum descriptions from the "Essential Museums" brochure on the Official tourism portal of Spain. These files are used in Language Detection and Text Translation skills processing demos. Content from the brochure was copied into individual Word document files, one for each museum and language combination.

Used in: Demo skills (Postman collections for Text Translation and Language Detection)

ARCHIVE

hotels-2019

The original version of the built-in sample containing fictitious hotel information.

Caselaw

An example that used data from the Caselaw has been updated to use different data and steps. The data file used for that exercise is now archived. The Caselaw Access Project provides public bulk downloads of case data by jurisdiction. Several jurisdictions are freely available without having to request access first. We chose the first one (Arkansas jurisdiction) and took the first 10 cases. The file name for this data set is caselaw-sample.json. If you upload this file to Azure Blob storage and use the Import data wizard to index the documents, choose the JSON Lines parsing mode.

About

Sample data shared across several Azure Search sample code projects.

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages

  • HTML 100.0%