Skip to content

Latest commit

 

History

History
150 lines (78 loc) · 4.13 KB

m4-1.md

File metadata and controls

150 lines (78 loc) · 4.13 KB

Visualizing your Data

Start digging in the data mines for this module here

'Shaft', by Kačka a Ondra, Flickr

<iframe width="420" height="315" src="https://www.youtube.com/embed/jIfu2A0ezq0" frameborder="0" allowfullscreen></iframe>

#Qualitative data vs Quantitative

  • how do you see patterns in words?

You count 'em.

  • or, you realize that you can count the patterns that clusters of words make up

image - sheet of paper where the whitespace makes images

http://biodiversitylibrary.org/page/37047310#page/496/mode/1up

Many tools available

  • and many approaches
  • traditionally, easier to get money to develop a new tool than to do research using that tool

See bamboo dirt

http://dirtdirectory.org/

Voyant

  • voyant-tools.org

image

So let's see what word counts can get us

Voyant contains many tools

OverView Project

image

  • overviewproject.org
  • Look at distributions of words over a corpus
  • TF-IDF

How it works

  • The cat sat on the mat. Then the cat chased the rat.
  • The cat slept all day on the mat.
  • The rat ran across the floor.

Strip the stopwords

  • cat sat mat cat chased rat
  • cat slept all day mat
  • rat ran across floor

Count what remains

image

TF-IDF

image to that re cats

  • compare every pair of documents
  • multiply the frequencies of corresponding words

Visualize

image

Tag, Visualize

image

Topic Modeling

  • many approaches go under the term 'topic modeling'
  • most common in humanities approaches: LDA

Topic Modeling Gettysburg Address

image to text

+the text

War v Governance

  • What % of this text is composed by a 'war' topic?
  • How do you know what 'war' words are?

Teaching the computer

  • supervised vs. unsupervised

How the world works

  • we all just pull from bags of words, right?

image bags of words

State library of Australia

Topic Model Gui

interface window

Typical output

typical output

Other ways of representing the conenction b/w topics & documents?

Imgur

  • Heatmap
  • Network

MALLET in R

image

Next Day

  • AntConc
  • NER
  • SNA with Gephi