Skip to content

Commit

Permalink
Addresses #9 - improves test coverage and documentation, also bugs
Browse files Browse the repository at this point in the history
Signed-off-by: Tim Bray <[email protected]>
  • Loading branch information
timbray committed Apr 7, 2024
1 parent 4eba337 commit f0135c9
Show file tree
Hide file tree
Showing 23 changed files with 50,820 additions and 408 deletions.
33 changes: 33 additions & 0 deletions .github/workflows/go.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -22,3 +22,36 @@ jobs:

- name: Test
run: make test

lint:
name: Code Linting
strategy:
matrix:
go-version: ["1.19"]
platform: ["ubuntu-latest"]
runs-on: ${{ matrix.platform }}

steps:
- name: Checkout repository
uses: actions/checkout@v2
with:
fetch-depth: 1

- name: Set up Go ${{ matrix.go-version }}
uses: actions/setup-go@v2
with:
go-version: ${{ matrix.go-version }}
id: go

- name: Restore Go cache
uses: actions/cache@v2
with:
path: |
~/.cache/go-build
~/go/pkg/mod
key: ${{ runner.os }}-go-${{ matrix.go-version }}-${{ hashFiles('**/go.sum', 'testdata/**') }}
restore-keys: |
${{ runner.os }}-go-${{ matrix.go-version }}-
- name: Run golangci-lint
uses: golangci/golangci-lint-action@08e2f20817b15149a52b5b3ebe7de50aff2ba8c5
19 changes: 14 additions & 5 deletions Makefile
Original file line number Diff line number Diff line change
@@ -1,13 +1,22 @@
.PHONY: test

all: test build linux
all: test build

test: */*.go
go test -v ./... && go vet ./...

build: */*.go
go build -o bin/tf .
build: bin/macos-arm/tf bin/macos-x86/tf bin/linux-x86/tf bin/linux-arm/tf

bin/macos-arm/tf: */*.go
GOOS=darwin GOARCH=arm64 go build -o bin/macos-arm/tf

bin/macos-x86/tf: */*.go
GOOS=darwin GOARCH=amd64 go build -o bin/macos-x86/tf

bin/linux-x86/tf: */*.go
GOOS=linux GOARCH=amd64 go build -o bin/linux-x86/tf

bin/linux-arm/tf: */*.go
GOOS=linux GOARCH=arm64 go build -o bin/linux-arm/tf

linux: */*.go
GOOS=linux go build -o bin/ltf .

108 changes: 46 additions & 62 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,111 +1,95 @@
# topfew
A program that finds records in which a
certain field or combination of fields occurs
most frequently
A program that finds and prints out the top few records in which a certain field or combination of fields occurs most frequently.

## Usage

```shell
tf
-n, --n [number of lines]
-f, --fields [fieldlist]
-h, -help, --help
-g, --grep [regexp]
-v, --vgrep [regexp]
-s, --sed [regexp] [replacement]
-w, --width [number of file segments]
-sample
[filename]
-n, --number (output line count) [default is 10]
-f, --fields (field list) [default is the whole record]
-g, --grep (regexp) [may repeat, default is accept all]
-v, --vgrep (regexp) [may repeat, default is reject none]
-s, --sed (regexp) (replacement) [may repeat, default is no changes]
-w, --width (segment count) [default is result of runtime.numCPU()]
--sample
-h, -help, --help
filename [default is stdin]

All the arguments are optional; if none are provided, tf will read records
from the standard input and list the 10 which occur most often.
```
## Options
`-n integer`, `--number integer` How many of the highest‐occurrence‐count lines to print out. The
default value is 10.
`-n integer`, `--number integer` How many of the highest‐occurrence‐count lines to print out.
The default value is 10.
`-f fieldlist, --fields fieldlist` Specifies which fields should be extracted from incoming records
and used in computing occurrence counts. The fieldlist must be a
comma‐separated list of integers identifying field numbers,
which start at one, for example 3 and 2,5,6. The fields
must be provided in order, so 3,1,7 is an error.
`-f fieldlist, --fields fieldlist` Specifies which fields should be extracted from incoming records and used in computing occurrence counts.
The fieldlist must be a comma‐separated list of integers identifying field numbers, which start at one, for example 3 and 2,5,6.
The fields must be provided in order, so 3,1,7 is an error.
If no fieldlist is provided, **tf** treats the whole input record as a single field.
`-g, regexp`, `--grep regexp`
`-g regexp`, `--grep regexp`
The initial **g** suggests `grep`. These options apply the provided
regular expression to, respectively, each record as it is read
and each field‐set as it is extracted, and if the regexp does
not match the record or field, cause tf to bypass the record.
The initial **g** suggests `grep`.
This option applies the provided regular expression to each record as it is read and if the regexp does not match the record, **tf** bypasses it.
These options can be provided multiple times; the provided regu‐
lar expressions will be applied in the order they appear on the
command line.
This option can be provided multiple times; the provided regular expressions will be applied in the order they appear on the command line.
`-v regexp`, `--vgrep regegxp`
The initial **v** suggests "grep ‐v". These operations are the in‐
verse of `‐grecord` and `‐gfield`, rejecting records and extracted
fields that match the provided regular expression. As with
those operations, these can be provided multiple times.
The initial **v** suggests `grep ‐v`. This operation is the inverse of `-g` and `-‐grep`, rejecting records that match the provided regular expression.
As with `grep`, it can be provided multiple times.
`-s regexp replacement`, `--sed regexp replacement`
As its name suggests, applies sed‐style editing by replacing any
text that matches the provided regexp with the provided replace‐
ment. It works on the fields in the fieldlist after they have
been extracted from the record.
As its name suggests, applies sed‐style editing by replacing any text that matches the provided regexp with the provided replacement.
It works on the fields in the fieldlist after they have been extracted from the record.
If ()‐enclosed capturing groups appear in the regexp, they may
be referred to as **$1**, **$2**, and so on in, the replacement.
If ()‐enclosed capturing groups appear in the regexp, they may be referred to as **$1**, **$2**, and so on in, the replacement.
This option can be provided many times, and the replacement op‐
erations are performed in the order they appear on the command
line.
This option can be provided many times, and the replacement operations are performed in the order they appear on the command line.
`--sample`
It can be tricky to get the regular expressions in the `−g`,
`−v`, and `−s` options right. Specifying
`-−sample` causes **tf** to print lines to the standard output that
display the filtering and field‐editing logic. It can only be
used when processing standard input, not a file.
It can be tricky to get the regular expressions in the `−g`, `−v`, and `−s` options right.
Specifying `-−sample` causes **tf** to print lines to the standard output that display the filtering and field‐editing logic.
It can only be used when processing standard input, not a file.
`-w integer`, `--width integer`
If a file name is specified then **tf**, rather than reading it from
end to end, will divide it into segements and process it in multiple
parallel threads. The optimal number of threads depends in a
complicated way on how many cores your CPU has what kind of cores
they are, and the storage architecture.
If a file name is specified then **tf**, rather than reading it from end to end, will divide it into segments and process it in multiple parallel threads.
The optimal number of threads depends in a complicated way on how many cores your CPU has what kind of cores they are, and the storage architecture.
The default is the result of the Go `runtime.NumCPU()` calls and
often produces good results.
The default is the result of the Go `runtime.NumCPU()` calls and often produces good results.
`-h`, `-help`, `--help`
Describes the function and options of tf.
Describes the function and options of **tf**.
## Examples
To find the IP address that most commonly hits your
web site, given an Apache logfile named `access_log`
To find the IP address that most commonly hits your web site, given an Apache logfile named `access_log`.
`tf -fields 1 access_log`
`tf --fields 1 access_log`
The same effect could be achieved with
`awk '{print $1}' access_log | sort | uniq -c | sort -rn | head`
But tf is usualy much faster.
But **tf** is usually much faster.
Do the same, but exclude high-traffic bots (omiting `access_log`)
Do the same, but exclude high-traffic bots (omitting the filename).
`tf -fields 1 -vrecord googlebot -vrecord bingbot`
`tf -fields 1 -vgrep googlebot -vgrep bingbot`
Most popular IP addresses from May 2020.
`tf -fields 1 -grecord '\[../May/2020' `
`tf -fields 1 -grep '\[../May/2020'`
Most popular hour/minute of the day for retrievals
Most popular hour/minute of the day for retrievals.
`tf -fields 4 -sed "\\[" "" -sed '^[^:]*:' '' -sed ':..$' '' `
`tf -fields 4 -sed "\\[" "" -sed '^[^:]*:' '' -sed ':..$' ''`
## Credits
Tim Bray created version 0.1 of Topfew, and the path toward 1.0 was based chiefly on ideas stolen from Dirkjan Ochtman and contributed by Simon Fell.
Loading

0 comments on commit f0135c9

Please sign in to comment.