From 8f3ff1a7cd99033e65ae2e83d0f7c3d595a0bb56 Mon Sep 17 00:00:00 2001 From: JBGruber Date: Sat, 6 Mar 2021 17:42:57 +0100 Subject: [PATCH] Re-knit Readme --- README.Rmd | 2 +- README.md | 79 +++++++++++++++++++++++------------------------------- 2 files changed, 35 insertions(+), 46 deletions(-) diff --git a/README.Rmd b/README.Rmd index 1f2c443..b0305d0 100644 --- a/README.Rmd +++ b/README.Rmd @@ -15,7 +15,7 @@ knitr::opts_chunk$set( [![R-CMD-check](https://github.com/JBGruber/LexisNexisTools/workflows/R-CMD-check/badge.svg)](https://github.com/JBGruber/LexisNexisTools/actions) [![CRAN_Status_Badge](https://www.r-pkg.org/badges/version-ago/LexisNexisTools)](https://cran.r-project.org/package=LexisNexisTools) [![CRAN_Download_Badge](http://cranlogs.r-pkg.org/badges/grand-total/LexisNexisTools)](https://cran.r-project.org/package=LexisNexisTools) -[![Coverage Status](https://codecov.io/gh/JBGruber/LexisNexisTools/branch/master/graph/badge.svg)](https://codecov.io/github/JBGruber/LexisNexisTools) +[![Codecov test coverage](https://codecov.io/gh/JBGruber/LexisNexisTools/branch/master/graph/badge.svg)](https://codecov.io/gh/JBGruber/LexisNexisTools?branch=master) ## Motivation diff --git a/README.md b/README.md index f220a3b..820d9ba 100755 --- a/README.md +++ b/README.md @@ -4,8 +4,8 @@ [![R-CMD-check](https://github.com/JBGruber/LexisNexisTools/workflows/R-CMD-check/badge.svg)](https://github.com/JBGruber/LexisNexisTools/actions) [![CRAN\_Status\_Badge](https://www.r-pkg.org/badges/version-ago/LexisNexisTools)](https://cran.r-project.org/package=LexisNexisTools) [![CRAN\_Download\_Badge](http://cranlogs.r-pkg.org/badges/grand-total/LexisNexisTools)](https://cran.r-project.org/package=LexisNexisTools) -[![Coverage -Status](https://codecov.io/gh/JBGruber/LexisNexisTools/branch/master/graph/badge.svg)](https://codecov.io/github/JBGruber/LexisNexisTools) +[![Codecov test +coverage](https://codecov.io/gh/JBGruber/LexisNexisTools/branch/master/graph/badge.svg)](https://codecov.io/gh/JBGruber/LexisNexisTools?branch=master) ## Motivation @@ -59,9 +59,9 @@ lnt_sample() ‘LexisNexis’ does not give its files proper names. The function `lnt_rename()` renames files to a standard format: For TXT files this format is “searchTerm\_startDate-endDate\_documentRange.txt” (e.g., -“Obama\_20091201-20100511\_1-500.txt”) (for other file types the -format is similar but depends on what information is available). Note, -that this will not work if your files lack a cover page with this +“Obama\_20091201-20100511\_1-500.txt”) (for other file types the format +is similar but depends on what information is available). Note, that +this will not work if your files lack a cover page with this information. Currently, it seems, like ‘LexisNexis’ only delivers those cover pages when you first create a link to your search (“link to this search” on the results page), follow this link, and then download the @@ -74,30 +74,24 @@ from a consistent naming scheme. There are three ways in which you can rename the files: - - Run lnt\_rename() directly in your working directory without the x +- Run lnt\_rename() directly in your working directory without the x argument, which will prompt an option to scan for TXT files in your current working directory: - - ``` r report <- lnt_rename() ``` - - Provide a folder path (and set `recursive = TRUE` if you want to +- Provide a folder path (and set `recursive = TRUE` if you want to scan for files recursively): - - ``` r report <- lnt_rename(x = getwd(), report = TRUE) ``` - - Provide a character object with file names. Use `list.files()` to +- Provide a character object with file names. Use `list.files()` to search for files in a certain path. - - ``` r my_files <- list.files(pattern = ".txt", path = getwd(), full.names = TRUE, recursive = TRUE, ignore.case = TRUE) @@ -107,7 +101,7 @@ report ``` | name\_orig | name\_new | status | type | -| :--------- | :-------------------------------------- | :------ | :--- | +|:-----------|:----------------------------------------|:--------|:-----| | sample.TXT | SampleFile\_20091201-20100511\_1-10.txt | renamed | txt | Using `list.files()` instead of the built-in mechanism allows you to @@ -133,33 +127,30 @@ some form but can be left to ‘auto’ to use ‘LexisNexis’ defaults in several languages. All keywords can be regular expressions and need to be in most cases: - - `start_keyword`: The English default is “\\d+ of \\d+ DOCUMENTS$” +- `start_keyword`: The English default is “\\d+ of \\d+ DOCUMENTS$” which stands for, for example, “1 of 112 DOCUMENTS”. It is used to split up the text in the TXT files into individual articles. You will not have to change anything here, except you work with documents in languages other than the currently supported. - - `end_keyword`: This keyword is used to remove unnecessary +- `end_keyword`: This keyword is used to remove unnecessary information at the end of an article. Usually, this is “^LANGUAGE:”. Where the keyword isn’t found, the additional information ends up in the article text. - - `length_keyword`: This keyword, which is usually just “^LENGTH:” (or +- `length_keyword`: This keyword, which is usually just “^LENGTH:” (or its equivalent in other languages) finds the information about the length of an article. However, since this is always the last line of the metadata, it is used to separate metadata and article text. There seems to be only one type of cases where this information is missing: if the article consists only of a graphic (which - ‘LexisNexis’ does not retrieve). The final output from - `lnt_read()` has a column named `Graphic`, which indicates if this - keyword was missing. The article text then contains all metadata as - well. In these cases, you should remove the whole article after - inspecting it. (Use - `View(LNToutput@articles$Article[LNToutput@meta$Graphic])` to view - these articles in a spreadsheet like viewer.) + ‘LexisNexis’ does not retrieve). The final output from `lnt_read()` + has a column named `Graphic`, which indicates if this keyword was + missing. The article text then contains all metadata as well. In + these cases, you should remove the whole article after inspecting + it. (Use `View(LNToutput@articles$Article[LNToutput@meta$Graphic])` + to view these articles in a spreadsheet like viewer.)

- -

To use the function, you can again provide either file name(s), folder @@ -207,11 +198,11 @@ paragraphs_df <- LNToutput@paragraphs head(meta_df, n = 3) ``` -| ID | Source\_File | Newspaper | Date | Length | Section | Author | Edition | Headline | Graphic | -| -: | :-------------------------------------- | :---------------- | :--------- | :-------- | :-------------- | :-------------- | :------------------ | :------------------------- | :------ | -| 1 | SampleFile\_20091201-20100511\_1-10.txt | Guardian.com | 2010-01-11 | 355 words | NA | Andrew Sparrow | NA | Lorem ipsum dolor sit amet | FALSE | -| 2 | SampleFile\_20091201-20100511\_1-10.txt | Guardian | 2010-01-11 | 927 words | NA | Simon Tisdall | NA | Lorem ipsum dolor sit amet | FALSE | -| 3 | SampleFile\_20091201-20100511\_1-10.txt | The Sun (England) | 2010-01-11 | 677 words | FEATURES; Pg. 6 | TREVOR Kavanagh | Edition 1; Scotland | Lorem ipsum dolor sit amet | FALSE | +| ID | Source\_File | Newspaper | Date | Length | Section | Author | Edition | Headline | Graphic | +|----:|:----------------------------------------|:------------------|:-----------|:----------|:----------------|:----------------|:--------------------|:---------------------------|:--------| +| 1 | SampleFile\_20091201-20100511\_1-10.txt | Guardian.com | 2010-01-11 | 355 words | NA | Andrew Sparrow | NA | Lorem ipsum dolor sit amet | FALSE | +| 2 | SampleFile\_20091201-20100511\_1-10.txt | Guardian | 2010-01-11 | 927 words | NA | Simon Tisdall | NA | Lorem ipsum dolor sit amet | FALSE | +| 3 | SampleFile\_20091201-20100511\_1-10.txt | The Sun (England) | 2010-01-11 | 677 words | FEATURES; Pg. 6 | TREVOR Kavanagh | Edition 1; Scotland | Lorem ipsum dolor sit amet | FALSE | If you want to keep only one data.frame including metadata and text data you can easily do so: @@ -297,9 +288,7 @@ lnt_diff(duplicates_df, min = 0, max = Inf) ```

- diff -

By default, 25 randomly selected articles are displayed one after @@ -323,9 +312,9 @@ LNToutput[1, ] #> 1 articles #> 5 paragraphs #> # A tibble: 1 x 10 -#> ID Source_File Newspaper Date Length Section Author Edition Headline -#> -#> 1 1 SampleFile… Guardian… 2010-01-11 355 w… Andre… Lorem i… +#> ID Source_File Newspaper Date Length Section Author Edition Headline +#> +#> 1 1 SampleFile_… Guardian… 2010-01-11 355 w… Andre… Lorem i… #> # … with 1 more variable: Graphic #> # A tibble: 1 x 2 #> ID Article @@ -358,11 +347,11 @@ paragraphs_df <- LNToutput@paragraphs head(meta_df, n = 3) ``` -| ID | Source\_File | Newspaper | Date | Length | Section | Author | Edition | Headline | Graphic | -| -: | :-------------------------------------- | :---------------- | :--------- | :-------- | :-------------- | :-------------- | :------------------ | :------------------------- | :------ | -| 1 | SampleFile\_20091201-20100511\_1-10.txt | Guardian.com | 2010-01-11 | 355 words | NA | Andrew Sparrow | NA | Lorem ipsum dolor sit amet | FALSE | -| 2 | SampleFile\_20091201-20100511\_1-10.txt | Guardian | 2010-01-11 | 927 words | NA | Simon Tisdall | NA | Lorem ipsum dolor sit amet | FALSE | -| 3 | SampleFile\_20091201-20100511\_1-10.txt | The Sun (England) | 2010-01-11 | 677 words | FEATURES; Pg. 6 | TREVOR Kavanagh | Edition 1; Scotland | Lorem ipsum dolor sit amet | FALSE | +| ID | Source\_File | Newspaper | Date | Length | Section | Author | Edition | Headline | Graphic | +|----:|:----------------------------------------|:------------------|:-----------|:----------|:----------------|:----------------|:--------------------|:---------------------------|:--------| +| 1 | SampleFile\_20091201-20100511\_1-10.txt | Guardian.com | 2010-01-11 | 355 words | NA | Andrew Sparrow | NA | Lorem ipsum dolor sit amet | FALSE | +| 2 | SampleFile\_20091201-20100511\_1-10.txt | Guardian | 2010-01-11 | 927 words | NA | Simon Tisdall | NA | Lorem ipsum dolor sit amet | FALSE | +| 3 | SampleFile\_20091201-20100511\_1-10.txt | The Sun (England) | 2010-01-11 | 677 words | FEATURES; Pg. 6 | TREVOR Kavanagh | Edition 1; Scotland | Lorem ipsum dolor sit amet | FALSE | ### Lookup Keywords @@ -411,9 +400,9 @@ LNToutput #> 1 articles #> 7 paragraphs #> # A tibble: 1 x 11 -#> ID Source_File Newspaper Date Length Section Author Edition Headline -#> -#> 1 9 SampleFile… Sunday M… 2010-01-10 446 w… NEWS; … Ross … 3 Star… R (prog… +#> ID Source_File Newspaper Date Length Section Author Edition Headline +#> +#> 1 9 SampleFile_… Sunday M… 2010-01-10 446 w… NEWS; … Ross … 3 Star… R (prog… #> # … with 2 more variables: Graphic , stats #> # A tibble: 1 x 2 #> ID Article