Skip to content

Commit

Permalink
Merge branch 'develop' into feature/corpus-model-validation
Browse files Browse the repository at this point in the history
  • Loading branch information
lukavdplas authored Sep 26, 2023
2 parents ceba0fb + 4aa606a commit e20b119
Show file tree
Hide file tree
Showing 50 changed files with 528 additions and 338 deletions.
3 changes: 1 addition & 2 deletions backend/addcorpus/constants.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,8 @@
from enum import Enum

CATEGORIES = [
('newspaper', 'Newspapers'),
('parliament', 'Parliamentary debates'),
('periodical', 'Periodicals'),
('periodical', 'Newspapers and other periodicals'),
('finance', 'Financial reports'),
('ruling', 'Court rulings'),
('review', 'Online reviews'),
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# Generated by Django 4.1.9 on 2023-09-21 14:16

from django.db import migrations, models


class Migration(migrations.Migration):

dependencies = [
('addcorpus', '0003_add_corpusconfiguration'),
]

operations = [
migrations.AlterField(
model_name='corpusconfiguration',
name='category',
field=models.CharField(choices=[('parliament', 'Parliamentary debates'), ('periodical', 'Newspapers and other periodicals'), ('finance', 'Financial reports'), ('ruling', 'Court rulings'), ('review', 'Online reviews'), ('inscription', 'Funerary inscriptions'), ('oration', 'Orations'), ('book', 'Books')], help_text='category/medium of documents in this dataset', max_length=64),
),
]
2 changes: 1 addition & 1 deletion backend/corpora/dutchnewspapers/dutchnewspapers_public.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ class DutchNewspapersPublic(XMLCorpusDefinition):
es_index = getattr(settings, 'DUTCHNEWSPAPERS_ES_INDEX', 'dutchnewspapers-public')
image = 'dutchnewspapers.jpg'
languages = ['nl']
category = 'newspaper'
category = 'periodical'

@property
def es_settings(self):
Expand Down
38 changes: 38 additions & 0 deletions backend/corpora/ecco/description/ecco.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
*Eighteenth Century Collections Online (ECCO)* is a fully text-searchable corpus of books, pamphlets and broadsides in all subjects printed between 1701 and 1800. It currently contains over 135,000 titles amounting to over 26 million fully searchable pages. *ECCO* is a digitization of the eighteenth-century section of the works catalogued in the *English Short-title Catalogue (ESTC)*.

Most of these works were printed in England, Scotland, Ireland and the United States, but it also contains works printed in territories under British colonial rule as well as from countries across Europe and Asia.

The corpus includes everything from six-penny broadsheets, pamphlets, books, government documents and more, written by or about people of all professions and classes.

### Subjects

- Multidisciplinary
- Eighteenth-century knowledge, thought, beliefs, events
- Age of Enlightenment
- Histories
- Poetry
- Novels
- Plays
- Law books
- Biographies
- Science
- Philosophy
- Dictionaries
- Theology/ Religion
- Diaries
- Almanacs
- … and many more

### Read more

Additional information can be found in the links below.

- [Access through publisher website (requires Utrecht University login)](https://go-gale-com.proxy.library.uu.nl/ps/start.do?p=ECCO&u=utrecht)
- [About this archive (publisher website; requires Utrecht University login)](https://go-gale-com.proxy.library.uu.nl/ps/helpCenter?userGroupName=utrecht&inPS=true&nspage=true&prodId=ECCO&docId=EFZIPA587871271)
- [Sample topics and searches (publisher website; requires Utrecht University login)](https://go-gale-com.proxy.library.uu.nl/ps/helpCenter?userGroupName=utrecht&inPS=true&nspage=true&prodId=ECCO&docId=OAWADC058207024&title=Sample%20Topics%20and%20Searches)

### Availability

*ECCO* is published by [Gale](https://en.wikipedia.org/wiki/Gale_(publisher)) and is only available to members of Utrecht University.

*Note:* Only the *ECCO Part I* is available on I-analyzer.
1 change: 1 addition & 0 deletions backend/corpora/ecco/ecco.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@
class Ecco(XMLCorpusDefinition):
title = "Eighteenth Century Collections Online"
description = "Digital collection of books published in Great Britain during the 18th century."
description_page = 'ecco.md'
min_date = datetime(year=1700, month=1, day=1)
max_date = datetime(year=1800, month=12, day=31)

Expand Down
30 changes: 30 additions & 0 deletions backend/corpora/guardianobserver/description/guardianobserver.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@

This corpus contains articles from *The Guardian* and *The Observer.*

### The Guardian

*The Guardian* is a British daily newspaper, originally founded in 1821 as *The Manchester Guardian*. It is a sister newspaper to both *The Observer* and *The Guardian Weekly*. It is considered a “newspaper of record” and is currently one of the most widely read in the UK and a respected newspaper in the world.

Political alignment: Centre-left

### Observer

*The Observer* is a British newspaper published weekly on Sundays. It is the world's oldest Sunday newspaper and is a sister paper to both *The Guardian* and *The Guardian Weekly*.

Political alignment: Centre-left; British republicanism

### Subjects

- Historical local, regional and national news
- Multidisciplinary

### Read more

- [The Guardian (Wikipedia)](https://en.wikipedia.org/wiki/The_Guardian)
- [Official website of The Guardian](https://www.theguardian.com/international)
- [The Observer (Wikipedia)](https://en.wikipedia.org/wiki/The_Observer)
- [Access through publisher website (requires Utrecht University login)](https://www.proquest.com/hnpguardianobserver/index?parentSessionId=SBW10zSG6gyVTa17wSPUIoNhfaXQZBxx2UvOA9%2FiYto%3D&accountid=14772)

### Availability

The Guardian/Observer corpus is published by [ProQuest](https://en.wikipedia.org/wiki/ProQuest) and is only available to members of Utrecht University.
3 changes: 2 additions & 1 deletion backend/corpora/guardianobserver/guardianobserver.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,14 +34,15 @@
class GuardianObserver(XMLCorpusDefinition):
title = "Guardian-Observer"
description = "Newspaper archive, 1791-2003"
description_page = 'guardianobserver.md'
min_date = datetime(year=1791, month=1, day=1)
max_date = datetime(year=2003, month=12, day=31)
data_directory = settings.GO_DATA
es_index = getattr(settings, 'GO_ES_INDEX', 'guardianobserver')
image = 'guardianobserver.jpg'
scan_image_type = getattr(settings, 'GO_SCAN_IMAGE_TYPE', 'application/pdf')
languages = ['en']
category = 'newspaper'
category = 'periodical'

@property
def es_settings(self):
Expand Down
Original file line number Diff line number Diff line change
@@ -1,10 +1,26 @@
### 19th Century UK Periodicals: new readerships
The *Nineteenth Century UK Periodicals* series covers the events, lives, values, and themes that shaped the nineteenth century world.

The 19th century was a time of revolutionary change and expansion. Britain was one of the
world’s first industrial, urban superpowers and developed a press to feed the demands
of its increasingly literate population: 19th Century UK Periodicals covers the events, lives,
values and themes that shaped the nineteenth-century world.
The collection is comprised of material published primarily in England, but also includes titles from Australia, Canada, India, South Africa, and many more.

The collection was predominantly sourced from two major libraries – the British Library and the National Library of Scotland.

### Subjects

- Empire and Colonialism
- Science and Industry
- Cities and Society
- Sport and Leisure
- Politics
- Daily Life
- Feminism
- Art and Culture
- Philosophy
- Literature
- Parenting
- Medicine
- … and many more

### Titles
The corpus includes the following 91 titles:
- Alexandra Magazine and Womans Social and Industrial Advocate
- Atalanta
Expand Down Expand Up @@ -96,4 +112,14 @@ The corpus includes the following 91 titles:
- Walters Theatrical and Sporting Directory and Book of Reference
- Womans Advocate
- Women and Work: A Weekly Industrial Educational and Household Register for Women
- Womens Penny Paper.
- Womens Penny Paper.

### Read more

- [Access through publisher website (requires Utrecht University login)](https://go-gale-com.proxy.library.uu.nl/ps/start.do?p=NCUK&u=utrecht)
- [About this archive (publisher website; requires Utrecht University login)](https://go-gale-com.proxy.library.uu.nl/ps/helpCenter?userGroupName=utrecht&inPS=true&nspage=true&prodId=NCUK&docId=DWSDAY911647535)
- [Sample topics and searches (publisher website; requires Utrecht University login)](https://go-gale-com.proxy.library.uu.nl/ps/helpCenter?userGroupName=utrecht&inPS=true&nspage=true&prodId=NCUK&docId=KEECCH350737398&title=Sample%20Topics%20and%20Searches)

### Availability

This corpus is published by [Gale](https://en.wikipedia.org/wiki/Gale_(publisher)) and is only available to members of Utrecht University.
36 changes: 32 additions & 4 deletions backend/corpora/times/description/times.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,34 @@
### The Times Digtial Archive 1785-2012
*The Times* is a British daily national newspaper, originally founded in 1785 as *The Daily Universal Register*. *The Times* is the oldest daily newspaper in continuous publication and remains one of the most widely read and respected newspapers in the world. It is a sister newspaper to *The Sunday Times*.

Political alignment: Conservative; Centre-right

This corpus contains a full-text version of 200 years of *The Times*, a critical source for studying a range of subjects.

This corpus contains a full-text version of 200 years of The Times, a critical source for studying a range of subjects.
All issues of this period are present, with the following exceptions:
- Issues of march 1785: they are missing in the publisher's archive.
- Issues in date range 01/01/1979 - 31/10/1979: during this period, a major general strike occured and no newspaper editions were published.
- Issues of March 1785: they are missing from the publisher's archive.
- Issues in date range 01/01/1979 - 31/10/1979: during this period, a major general strike occurred, and no newspaper editions were published

### Subjects

- Historical local, regional and national news
- Multidisciplinary
- Business
- Humanities
- Political Science
- Philosophy
- Major international historical events

### Read more

- [The Times (Wikipedia)](https://en.wikipedia.org/wiki/The_Times)
- [Access through publisher website (requires Utrecht University login)](https://go-gale-com.proxy.library.uu.nl/ps/start.do?p=TTDA&u=utrecht)
- [About this archive (publisher website; requires Utrecht University login)](https://go-gale-com.proxy.library.uu.nl/ps/helpCenter?userGroupName=utrecht&inPS=true&nspage=true&prodId=TTDA&docId=QCOGMG579883681)
- [Sample topics and searches](https://go-gale-com.proxy.library.uu.nl/ps/helpCenter?userGroupName=utrecht&inPS=true&nspage=true&prodId=TTDA&docId=GCANVE436736839&title=Sample%20Topics%20and%20Searches)

### Availability

This corpus is published by [Gale](https://en.wikipedia.org/wiki/Gale_(publisher)) and is only available to members of Utrecht University.

### Image source

Corpus image from [Wikimedia Commons](https://commons.wikimedia.org/wiki/File:Twice_round_the_clock;_or,_The_hours_of_the_day_and_night_in_London_(1859)_(14776691334).jpg)
Binary file modified backend/corpora/times/images/times.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added backend/corpora/times/images/times.jpg~
Binary file not shown.
Binary file removed backend/corpora/times/images/times_thumb.jpg
Binary file not shown.
2 changes: 1 addition & 1 deletion backend/corpora/times/times.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ class Times(XMLCorpusDefinition):
scan_image_type = getattr(settings, 'TIMES_SCAN_IMAGE_TYPE', 'image/png')
description_page = 'times.md'
languages = ['en']
category = 'newspaper'
category = 'periodical'

@property
def es_settings(self):
Expand Down
Loading

0 comments on commit e20b119

Please sign in to comment.