Skip to content

Commit

Permalink
[Docs] Add Ingest tools overview and Upload data files (#526)
Browse files Browse the repository at this point in the history
Closes: [#327](elastic/docs-projects#372) and
[#444](elastic/docs-projects#444)
Previews: [Ingest tools
overview](https://docs-v3-preview.elastic.dev/elastic/docs-content/pull/526/manage-data/ingest/tools)
and [Upload data
files](https://docs-v3-preview.elastic.dev/elastic/docs-content/pull/526/manage-data/ingest/upload-data-files)

### Summary

- As discussed, combined content from linked resources to add a table
for the tools overview. A lot of pages have moved around so I'd really
appreciate a quick double-check of the links to make sure they're
pointing to the right location 🙏
- Added content to the Upload data files page and changed its location
(see questions below)

### Questions

1. Since we decided that a table would work best for the overview, I'm
not sure how to incorporate
https://www.elastic.co/guide/en/cloud/current/ec-cloud-ingest-data.html
on this page. I don't think all of its content fits here based on the
new IA. Maybe it could go in the reference section? I'm open to any
ideas!!
2. The Upload data files page was originally nested under the Ingest
tools overview, but I think it's better suited as a sibling in the
overall Manage data/Ingest section. What do you think?
  • Loading branch information
wajihaparvez authored Feb 20, 2025
1 parent 1c2e612 commit 7254fdc
Show file tree
Hide file tree
Showing 15 changed files with 105 additions and 40 deletions.
2 changes: 1 addition & 1 deletion explore-analyze/machine-learning/nlp/ml-nlp-ner-example.md
Original file line number Diff line number Diff line change
Expand Up @@ -113,7 +113,7 @@ Using the example text "Elastic is headquartered in Mountain View, California.",

## Add the NER model to an {{infer}} ingest pipeline [ex-ner-ingest]

You can perform bulk {{infer}} on documents as they are ingested by using an [{{infer}} processor](https://www.elastic.co/guide/en/elasticsearch/reference/current/inference-processor.html) in your ingest pipeline. The novel *Les Misérables* by Victor Hugo is used as an example for {{infer}} in the following example. [Download](https://github.com/elastic/stack-docs/blob/8.5/docs/en/stack/ml/nlp/data/les-miserables-nd.json) the novel text split by paragraph as a JSON file, then upload it by using the [Data Visualizer](../../../manage-data/ingest/tools/upload-data-files.md). Give the new index the name `les-miserables` when uploading the file.
You can perform bulk {{infer}} on documents as they are ingested by using an [{{infer}} processor](https://www.elastic.co/guide/en/elasticsearch/reference/current/inference-processor.html) in your ingest pipeline. The novel *Les Misérables* by Victor Hugo is used as an example for {{infer}} in the following example. [Download](https://github.com/elastic/stack-docs/blob/8.5/docs/en/stack/ml/nlp/data/les-miserables-nd.json) the novel text split by paragraph as a JSON file, then upload it by using the [Data Visualizer](../../../manage-data/ingest/upload-data-files.md). Give the new index the name `les-miserables` when uploading the file.

Now create an ingest pipeline either in the [Stack management UI](ml-nlp-inference.md#ml-nlp-inference-processor) or by using the API:

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -103,7 +103,7 @@ In this step, you load the data that you later use in an ingest pipeline to get

The data set `msmarco-passagetest2019-top1000` is a subset of the MS MARCO Passage Ranking data set used in the testing stage of the 2019 TREC Deep Learning Track. It contains 200 queries and for each query a list of relevant text passages extracted by a simple information retrieval (IR) system. From that data set, all unique passages with their IDs have been extracted and put into a [tsv file](https://github.com/elastic/stack-docs/blob/8.5/docs/en/stack/ml/nlp/data/msmarco-passagetest2019-unique.tsv), totaling 182469 passages. In the following, this file is used as the example data set.

Upload the file by using the [Data Visualizer](../../../manage-data/ingest/tools/upload-data-files.md). Name the first column `id` and the second one `text`. The index name is `collection`. After the upload is done, you can see an index named `collection` with 182469 documents.
Upload the file by using the [Data Visualizer](../../../manage-data/ingest/upload-data-files.md). Name the first column `id` and the second one `text`. The index name is `collection`. After the upload is done, you can see an index named `collection` with 182469 documents.

:::{image} ../../../images/machine-learning-ml-nlp-text-emb-data.png
:alt: Importing the data
Expand Down
2 changes: 1 addition & 1 deletion manage-data/ingest.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ Elastic offer tools designed to ingest specific types of general content. The co
* To send **application data** directly to {{es}}, use an [{{es}} language client](https://www.elastic.co/guide/en/elasticsearch/client/index.html).
* To index **web page content**, use the Elastic [web crawler](https://www.elastic.co/web-crawler).
* To sync **data from third-party sources**, use [connectors](https://www.elastic.co/guide/en/elasticsearch/reference/current/es-connectors.html). A connector syncs content from an original data source to an {{es}} index. Using connectors you can create *searchable*, read-only replicas of your data sources.
* To index **single files** for testing in a non-production environment, use the {{kib}} [file uploader](ingest/tools/upload-data-files.md).
* To index **single files** for testing in a non-production environment, use the {{kib}} [file uploader](ingest/upload-data-files.md).

If you would like to try things out before you add your own data, try using our [sample data](ingest/sample-data.md).

Expand Down
34 changes: 28 additions & 6 deletions manage-data/ingest/tools.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,15 +18,37 @@ mapped_urls:

% Use migrated content from existing pages that map to this page:

% - [ ] ./raw-migrated-files/cloud/cloud/ec-cloud-ingest-data.md
% - [x] ./raw-migrated-files/cloud/cloud/ec-cloud-ingest-data.md
% Notes: These are resources to pull from, but this new "Ingest tools overiew" page will not be a replacement for any of these old AsciiDoc pages. File upload: https://www.elastic.co/guide/en/kibana/current/connect-to-elasticsearch.html#upload-data-kibana https://www.elastic.co/guide/en/serverless/current/elasticsearch-ingest-data-file-upload.html API: https://www.elastic.co/guide/en/kibana/current/connect-to-elasticsearch.html#_add_data_with_programming_languages https://www.elastic.co/guide/en/elasticsearch/reference/current/docs.html https://www.elastic.co/guide/en/serverless/current/elasticsearch-ingest-data-through-api.html OpenTelemetry: https://github.com/elastic/opentelemetry Fleet and Agent: https://www.elastic.co/guide/en/fleet/current/fleet-overview.html https://www.elastic.co/guide/en/serverless/current/fleet-and-elastic-agent.html Logstash: https://www.elastic.co/guide/en/logstash/current/introduction.html https://www.elastic.co/guide/en/serverless/current/elasticsearch-ingest-data-through-logstash.html https://www.elastic.co/guide/en/serverless/current/logstash-pipelines.html Beats: https://www.elastic.co/guide/en/beats/libbeat/current/beats-reference.html https://www.elastic.co/guide/en/serverless/current/elasticsearch-ingest-data-through-beats.html APM: /solutions/observability/apps/application-performance-monitoring-apm.md Application logging: https://www.elastic.co/guide/en/observability/current/application-logs.html ECS logging: https://www.elastic.co/guide/en/observability/current/logs-ecs-application.html Elastic serverless forwarder for AWS: https://www.elastic.co/guide/en/esf/current/aws-elastic-serverless-forwarder.html Integrations: https://www.elastic.co/guide/en/integrations/current/introduction.html Search connectors: https://www.elastic.co/guide/en/elasticsearch/reference/current/es-connectors.html https://www.elastic.co/guide/en/serverless/current/elasticsearch-ingest-data-through-integrations-connector-client.html Web crawler: https://github.com/elastic/crawler/tree/main/docs
% - [ ] ./raw-migrated-files/ingest-docs/fleet/beats-agent-comparison.md
% - [ ] ./raw-migrated-files/kibana/kibana/connect-to-elasticsearch.md
% - [ ] https://www.elastic.co/customer-success/data-ingestion
% - [ ] https://github.com/elastic/ingest-docs/pull/1373
% - [This comparison page is being moved to the reference section, so I'm linking to that from the current page - Wajiha] ./raw-migrated-files/ingest-docs/fleet/beats-agent-comparison.md
% - [x] ./raw-migrated-files/kibana/kibana/connect-to-elasticsearch.md
% - [x] https://www.elastic.co/customer-success/data-ingestion
% - [x] https://github.com/elastic/ingest-docs/pull/1373

% Internal links rely on the following IDs being on this page (e.g. as a heading ID, paragraph ID, etc):
% Internal links rely on the following IDs being on this page (e.g. as a heading ID, paragraph ID, etc):
% These IDs are from content that I'm not including on this current page. I've resolved them by changing the internal links to anchor links where needed. - Wajiha

$$$supported-outputs-beats-and-agent$$$

$$$additional-capabilities-beats-and-agent$$$

Depending on the type of data you want to ingest, you have a number of methods and tools available for use in your ingestion process. The table below provides more information about the available tools. Refer to our [Ingestion](/manage-data/ingest.md) overview for some guidelines to help you select the optimal tool for your use case.

<br>

| Tools | Usage | Links to more information |
| ------- | --------------- | ------------------------- |
| Integrations | Ingest data using a variety of Elastic integrations. | [Elastic Integrations](https://www.elastic.co/guide/en/integrations/current/index.html) |
| File upload | Upload data from a file and inspect it before importing it into {{es}}. | [Upload data files](/manage-data/ingest/upload-data-files.md) |
| APIs | Ingest data through code by using the APIs of one of the language clients or the {{es}} HTTP APIs. | [Document APIs](https://www.elastic.co/guide/en/elasticsearch/reference/current/docs.html) |
| OpenTelemetry | Collect and send your telemetry data to Elastic Observability | [Elastic Distributions of OpenTelemetry](https://github.com/elastic/opentelemetry?tab=readme-ov-file#elastic-distributions-of-opentelemetry) |
| Fleet and Elastic Agent | Add monitoring for logs, metrics, and other types of data to a host using Elastic Agent, and centrally manage it using Fleet. | [Fleet and {{agent}} overview](https://www.elastic.co/guide/en/fleet/current/fleet-overview.html) <br> [{{fleet}} and {{agent}} restrictions (Serverless)](https://www.elastic.co/guide/en/fleet/current/fleet-agent-serverless-restrictions.html) <br> [{{beats}} and {{agent}} capabilities](https://www.elastic.co/guide/en/fleet/current/beats-agent-comparison.html)||
| {{elastic-defend}} | {{elastic-defend}} provides organizations with prevention, detection, and response capabilities with deep visibility for EPP, EDR, SIEM, and Security Analytics use cases across Windows, macOS, and Linux operating systems running on both traditional endpoints and public cloud environments. | [Configure endpoint protection with {{elastic-defend}}](/solutions/security/configure-elastic-defend.md) |
| {{ls}} | Dynamically unify data from a wide variety of data sources and normalize it into destinations of your choice with {{ls}}. | [Logstash (Serverless)](https://www.elastic.co/guide/en/serverless/current/elasticsearch-ingest-data-through-logstash.html) <br> [Logstash pipelines](/manage-data/ingest/transform-enrich/logstash-pipelines.md) |
| {{beats}} | Use {{beats}} data shippers to send operational data to Elasticsearch directly or through Logstash. | [{{beats}} (Serverless)](https://www.elastic.co/guide/en/serverless/current/elasticsearch-ingest-data-through-beats.html) <br> [What are {{beats}}?](https://www.elastic.co/guide/en/beats/libbeat/current/beats-reference.html) <br> [{{beats}} and {{agent}} capabilities](https://www.elastic.co/guide/en/fleet/current/beats-agent-comparison.html)|
| APM | Collect detailed performance information on response time for incoming requests, database queries, calls to caches, external HTTP requests, and more. | [Application performance monitoring (APM)](/solutions/observability/apps/application-performance-monitoring-apm.md) |
| Application logs | Ingest application logs using Filebeat, {{agent}}, or the APM agent, or reformat application logs into Elastic Common Schema (ECS) logs and then ingest them using Filebeat or {{agent}}. | [Stream application logs](/solutions/observability/logs/stream-application-logs.md) <br> [ECS formatted application logs](/solutions/observability/logs/ecs-formatted-application-logs.md) |
| Elastic Serverless forwarder for AWS | Ship logs from your AWS environment to cloud-hosted, self-managed Elastic environments, or {{ls}}. | [Elastic Serverless Forwarder](https://www.elastic.co/guide/en/esf/current/aws-elastic-serverless-forwarder.html) |
| Connectors | Use connectors to extract data from an original data source and sync it to an {{es}} index. | [Ingest content with Elastic connectors
](https://www.elastic.co/guide/en/elasticsearch/reference/current/es-connectors.html) <br> [Connector clients](https://www.elastic.co/guide/en/serverless/current/elasticsearch-ingest-data-through-integrations-connector-client.html) |
| Web crawler | Discover, extract, and index searchable content from websites and knowledge bases using the web crawler. | [Elastic Open Web Crawler](https://github.com/elastic/crawler#readme) |
19 changes: 0 additions & 19 deletions manage-data/ingest/tools/upload-data-files.md

This file was deleted.

63 changes: 63 additions & 0 deletions manage-data/ingest/upload-data-files.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
---
mapped_urls:
- https://www.elastic.co/guide/en/serverless/current/elasticsearch-ingest-data-file-upload.html
- https://www.elastic.co/guide/en/kibana/current/connect-to-elasticsearch.html#upload-data-kibana
---

# Upload data files [upload-data-kibana]

% What needs to be done: Align serverless/stateful

% Use migrated content from existing pages that map to this page:

% - [x] ./raw-migrated-files/docs-content/serverless/elasticsearch-ingest-data-file-upload.md
% - [x] ./raw-migrated-files/kibana/kibana/connect-to-elasticsearch.md

% Note from David: I've removed the ID $$$upload-data-kibana$$$ from manage-data/ingest.md as those links should instead point to this page. So, please ensure that the following ID is included on this page. I've added it beside the title.

You can upload files, view their fields and metrics, and optionally import them to {{es}} with the Data Visualizer.

To use the Data Visualizer, click **Upload a file** on the {{es}} **Getting Started** page or navigate to the **Integrations** view and search for **Upload a file**. Clicking **Upload a file** opens the Data Visualizer UI.

:::{image} /images/serverless-file-uploader-UI.png
:alt: File upload UI
:class: screenshot
:::

Drag a file into the upload area or click **Select or drag and drop a file** to choose a file from your computer.

You can upload different file formats for analysis with the Data Visualizer:

File formats supported up to 500 MB:

* CSV
* TSV
* NDJSON
* Log files

File formats supported up to 60 MB:

* PDF
* Microsoft Office files (Word, Excel, PowerPoint)
* Plain Text (TXT)
* Rich Text (RTF)
* Open Document Format (ODF)

The Data Visualizer displays the first 1000 rows of the file. You can inspect the data and make any necessary changes before importing it. Click **Import** continue the process.

This process will create an index and import the data into {{es}}. Once your data is in {{es}}, you can start exploring it, see [Explore and analyze](/explore-analyze/index.md) for more information.

::::{important}
The upload feature is not intended for use as part of a repeated production process, but rather for the initial exploration of your data.

::::

## Required privileges

The {{stack-security-features}} provide roles and privileges that control which users can upload files. To upload a file in {{kib}} and import it into an {{es}} index, you’ll need:

* `manage_pipeline` or `manage_ingest_pipelines` cluster privilege
* `create`, `create_index`, `manage`, and `read` index privileges for the index
* `all` {{kib}} privileges for **Discover** and **Data Views Management**

You can manage your roles, privileges, and spaces in **{{stack-manage-app}}**.
3 changes: 1 addition & 2 deletions manage-data/toc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,7 @@ toc:
- file: ingest/ingest-reference-architectures/agent-es-airgapped.md
- file: ingest/ingest-reference-architectures/agent-ls-airgapped.md
- file: ingest/sample-data.md
- file: ingest/upload-data-files.md
- file: ingest/transform-enrich.md
children:
- file: ingest/transform-enrich/ingest-pipelines-serverless.md
Expand All @@ -106,8 +107,6 @@ toc:
- file: ingest/transform-enrich/example-enrich-data-by-matching-value-to-range.md
- file: ingest/transform-enrich/index-mapping-text-analysis.md
- file: ingest/tools.md
children:
- file: ingest/tools/upload-data-files.md
- file: lifecycle.md
children:
- file: lifecycle/data-tiers.md
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -824,7 +824,7 @@ In this step, you load the data that you later use in the {{infer}} ingest pipel

Use the `msmarco-passagetest2019-top1000` data set, which is a subset of the MS MARCO Passage Ranking data set. It consists of 200 queries, each accompanied by a list of relevant text passages. All unique passages, along with their IDs, have been extracted from that data set and compiled into a [tsv file](https://github.com/elastic/stack-docs/blob/main/docs/en/stack/ml/nlp/data/msmarco-passagetest2019-unique.tsv).

Download the file and upload it to your cluster using the [Data Visualizer](../../../manage-data/ingest/tools/upload-data-files.md) in the {{ml-app}} UI. After your data is analyzed, click **Override settings**. Under **Edit field names***, assign `id` to the first column and `content` to the second. Click ***Apply***, then ***Import**. Name the index `test-data`, and click **Import**. After the upload is complete, you will see an index named `test-data` with 182,469 documents.
Download the file and upload it to your cluster using the [Data Visualizer](../../../manage-data/ingest/upload-data-files.md) in the {{ml-app}} UI. After your data is analyzed, click **Override settings**. Under **Edit field names***, assign `id` to the first column and `content` to the second. Click ***Apply***, then ***Import**. Name the index `test-data`, and click **Import**. After the upload is complete, you will see an index named `test-data` with 182,469 documents.


## Ingest the data through the {{infer}} ingest pipeline [reindexing-data-infer]
Expand Down
Loading

0 comments on commit 7254fdc

Please sign in to comment.