Example collection #13

ruebot · 2019-03-03T01:15:47Z

Do we want to use this one? If so, we should probably cite it in the notebook. We normally do Canadian Political Parties and Interest Groups, but those are some big derivatives.

- Resolve #14 - Resolve #13 - Update notebooks to use NLTK stopwords - Add NLTK stopwords

ianmilligan1 · 2019-03-03T21:04:33Z

Hmm. Let's think a bit more on this. Agreed that CPP is a bit too big. The current example data we're using isn't ideal. I think we'd like a small-ish collection with:

multiple years;
multiple domains;
and in a dream world, multiple languages.

I think Victoria might have some ideal candidate collections. I can try to find a cycle to dig through some of the Archive-It pages, but @SamFritz if you have a moment do you want to take a look around the UVic archive-it pages and see if there are any ones that fit that criteria?

ianmilligan1 · 2019-03-03T21:07:59Z

Actually, do any of these collections have manageable derivative sizes? (I don't have a UVic collection synced in the Cloud right)

If any of those stand out, I can write to UVic to see if they are interested in being used as "sample data."

ruebot · 2019-03-03T21:26:07Z

The Trans Web:

Gephi: 4.46MB
Raw Network: 1.67MB
Domains: 10.4KB
Full Text: 1.81GB
Text by Domains: needs to be run

British Columbia Local Governments:

Gephi: 9.11MB
Raw Network: 4.38MB
Domains: 71.3KB
Full Text: 30.7GB
Text by Domains: needs to be run

B.C. Teachers' Labour Dispute (2014):

Gephi: 2MB
Raw Network: 751KB
Domains: 19.9KB
Full Text: 367MB
Text by Domains: needs to be run

Trans Web:

Gephi: 5.16MB
Raw Network: 1.81MB
Domains: 9.91KB
Full Text: 1.49GB
Text by Domains: 79.2MB

ianmilligan1 · 2019-03-03T21:35:07Z

OK great, thanks @ruebot. I like BC Teachers Labour Dispute: neat topic, has mostly content from 2014 but also from 2015, fair number of domains, and different domains that take very divergent perspectives on the issue. Plus it's about the size that we could bundle with the image, knock on wood.

@greebie @ruebot @SamFritz provide any thoughts you might have on using this as a sample datasets.. if I get thumbs up, I'd like to reach out to UVic.

ruebot · 2019-03-03T22:43:17Z

Once we're in agreement, I'll create a branch for it.

greebie · 2019-03-03T23:28:39Z

I have the UVIc account logged into my cloud account. I think the Teachers labor dispute has legs. I like the Transweb one, but I don't think it has much in terms of years available yet.

ruebot · 2019-03-04T03:01:54Z

Should we have a section in the README like we do in docker-auk once we figure out which collection to use?

- Resolve #14 - Partially address #13 - Resolve #17 - Update notebooks to use NLTK stopwords - Add NLTK stopwords

SamFritz · 2019-03-04T19:59:27Z

agreed, I think the BC Teachers Labour Dispute collection would work well, as a second runner I probably select the Trans Web collection (text wise it's a bit larger).

ianmilligan1 · 2019-03-04T20:10:20Z

Perfect, thanks all. I'll send them an e-mail to see if there's interest.

ruebot · 2019-03-04T20:49:00Z

The next spark job in the queue is for the BC Teachers collections. Should be done later tonight, or early tomorrow. I'll create a branch, and we'll see if it works. I think we'll be fine with the GitHub size limits.

ruebot · 2019-03-05T13:18:39Z

I have the data ready. We need to work through #21 and #22 before I can move forward with this. Both are fairly straightforward, so hopefully we can get to this one by the end of the day worst case.

ruebot · 2019-03-05T20:38:10Z

Back to the drawing board. We need a collection where all the derivatives are under a 100MB.

$ git push origin issue-13
Counting objects: 10, done.
Delta compression using up to 12 threads.
Compressing objects: 100% (10/10), done.
Writing objects: 100% (10/10), 72.35 MiB | 1000.00 KiB/s, done.
Total 10 (delta 2), reused 0 (delta 0)
remote: Resolving deltas: 100% (2/2), completed with 2 local objects.
remote: error: GH001: Large files detected. You may want to try Git Large File Storage - https://git-lfs.github.com.
remote: error: Trace: 11c76664f8efdae8bf95f093f60b634e
remote: error: See http://git.io/iEPt8g for more information.
remote: error: File data/4867-fulltext.txt is 360.63 MB; this exceeds GitHub's file size limit of 100.00 MB
To github.com:archivesunleashed/auk-notebooks.git
 ! [remote rejected] issue-13 -> issue-13 (pre-receive hook declined)
error: failed to push some refs to '[email protected]:archivesunleashed/auk-notebooks.git'

ruebot · 2019-03-05T20:47:18Z

Snowden Archive

First Nations and Indigenous Community websites

Nova Scotia FOIPOP portal breach

Legalization of cannabis in Nova Scotia

NGOs

greebie · 2019-03-05T20:57:20Z

Maybe we could just truncate the text? The script will only read the first 2500 lines anyway.

ianmilligan1 · 2019-03-05T22:07:35Z

Yeah, I think truncating the text would work here. Trim the text to 35MB or so and just make clear that it’s a sample in the README?

ruebot · 2019-03-05T22:37:46Z

Cool. 43k lines of text from the file is: 99M. That should do it. I'll test in a moment.

greebie · 2019-03-05T22:38:48Z

Ooops. Sorry - I had a comment and then closed the issue instead of deleting it.

- Remove existing dataset - Add 4867 data - Update example notebook - Update README

ruebot added a commit that referenced this issue Mar 3, 2019

Use NLTK stopwords, cite example dataset.

ef3a65a

- Resolve #14 - Resolve #13 - Update notebooks to use NLTK stopwords - Add NLTK stopwords

ruebot self-assigned this Mar 3, 2019

ruebot mentioned this issue Mar 3, 2019

Use NLTK stopwords, cite example dataset. #15

Merged

ruebot added enhancement New feature or request question Further information is requested labels Mar 4, 2019

ianmilligan1 pushed a commit that referenced this issue Mar 4, 2019

Use NLTK stopwords, update README (#15)

d1088fa

- Resolve #14 - Partially address #13 - Resolve #17 - Update notebooks to use NLTK stopwords - Add NLTK stopwords

greebie closed this as completed Mar 5, 2019

greebie reopened this Mar 5, 2019

ruebot added a commit that referenced this issue Mar 5, 2019

Add B.C. Teachers' Labour Dispute (2014); resolves #13.

2587154

- Remove existing dataset - Add 4867 data - Update example notebook - Update README

ruebot added a commit that referenced this issue Mar 5, 2019

Add B.C. Teachers' Labour Dispute (2014); resolves #13.

0a89abd

- Remove existing dataset - Add 4867 data - Update example notebook - Update README

ianmilligan1 closed this as completed in 7e20729 Mar 6, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Example collection #13

Example collection #13

ruebot commented Mar 3, 2019

ianmilligan1 commented Mar 3, 2019

ianmilligan1 commented Mar 3, 2019

ruebot commented Mar 3, 2019

ianmilligan1 commented Mar 3, 2019

ruebot commented Mar 3, 2019

greebie commented Mar 3, 2019

ruebot commented Mar 4, 2019

SamFritz commented Mar 4, 2019

ianmilligan1 commented Mar 4, 2019

ruebot commented Mar 4, 2019

ruebot commented Mar 5, 2019

ruebot commented Mar 5, 2019

ruebot commented Mar 5, 2019

greebie commented Mar 5, 2019

ianmilligan1 commented Mar 5, 2019

ruebot commented Mar 5, 2019

greebie commented Mar 5, 2019

Example collection #13

Example collection #13

Comments

ruebot commented Mar 3, 2019

ianmilligan1 commented Mar 3, 2019

ianmilligan1 commented Mar 3, 2019

ruebot commented Mar 3, 2019

ianmilligan1 commented Mar 3, 2019

ruebot commented Mar 3, 2019

greebie commented Mar 3, 2019

ruebot commented Mar 4, 2019

SamFritz commented Mar 4, 2019

ianmilligan1 commented Mar 4, 2019

ruebot commented Mar 4, 2019

ruebot commented Mar 5, 2019

ruebot commented Mar 5, 2019

ruebot commented Mar 5, 2019

Snowden Archive

First Nations and Indigenous Community websites

Nova Scotia FOIPOP portal breach

Legalization of cannabis in Nova Scotia

NGOs

greebie commented Mar 5, 2019

ianmilligan1 commented Mar 5, 2019

ruebot commented Mar 5, 2019

greebie commented Mar 5, 2019