Scio quickstart

This repository contains a sample pipeline for starting with Scio, the Scala framework to develop Apache Beam pipelines.

Fork this repository so you can commit your changes in your own repository.

Pipeline

The goal of this example is to count the words in Don Quixote, the famous novel by Miguel de Cervantes. The novel has several characters: Sancho, the buddy of Don Quixote; Dulcinea, the significant other of Don Quixote; Rocinante, the fearful horse of Don Quixote, etc.

The pipeline does not only count the words, it also sorts the words by number of occurrences, and provides an answer to an existential question: who is mentioned more in the novel, Sancho or Dulcinea?

Let's find out with the help of Scio.

Compile

The first step to solve the mysterious question is to compile the code. For that, you will need to have installed SBT:

https://www.scala-sbt.org/

When you have installed, you can run

sbt compile to compile the code (for instance, while you are developing the code for the pipeline)
sbt stage to produce a runnable package

Input data

In the data directory you will find two files:

sample.txt, small extract of the novel. You can use this for tests while you are developing the pipeline
el_quijote.txt, the full novel, to solve the important question about Sancho or Dulcinea

Running the example

Once you have run sbt stage, there will be a script in the directory target/universal/stage/bin. You can use that script to run the pipeline.

For instance, to find the top 10 words in the sample data:

./target/universal/stage/bin/scio-quickstart --input-file=./data/sample.txt --output-file=tmp --num-words=10

After that you should find a file with a name like part-00000-of-00001.txt in the tmp subdirectory.

To run with the full data and top 100 words:

./target/universal/stage/bin/scio-quickstart --input-file=./data/el_quijote.txt --output-file=tmp --num-words=100

Search for sancho and dulcinea in the output to solve this burning question.

Development

The pipeline is initially empty. Your task, should you accept it, is to create the pipeline that is required to solve the Sancho vs. Dulcinea question.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
project		project
src/main/scala/dev/herraiz/scio		src/main/scala/dev/herraiz/scio
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build.sbt		build.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Scio quickstart

Pipeline

Compile

Input data

Running the example

Development

About

Uh oh!

Languages

License

iht/scio-quickstart

Folders and files

Latest commit

History

Repository files navigation

Scio quickstart

Pipeline

Compile

Input data

Running the example

Development

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages