-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathFirstnames.Rmd
236 lines (189 loc) · 7.12 KB
/
Firstnames.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
---
title: "Given names"
author: "Florian Gaudin-Delrieu"
date: "21 mars 2016"
output: html_document
---
## About
The idea is to study the data about given names from [wikipedia.org](http://wikipedia.org), and in particular, what are the occupations of people with that given name.
It turns out that scrapping wikipedia, getting the first names and the occupation would be difficult, as the data is unstructured. Hopefully, the [wikidata project](http://wikidata.org) exists. It gives access to structured data, so we will use that to get the informations we want.
```{r Needed libraries, message=FALSE}
library(ggplot2)
library(WikidataR)
library(dplyr)
library(tidyr)
library(magrittr)
library(ggthemes)
library(plotly)
library(SPARQL)
library(wordcloud)
library(stringr)
theme_set(theme_minimal(12))
```
## Retrieving informations
We will get the item reference to the given name we want to study (it starts by a Q and is not meaningful). Then we can use that item to feed a SPARQL query that will give us the dataset.
### Getting the item id
We want to get informations about one first name, let's say "Marcel". We will use the find_item function from the WikidataR library.
```{r Getting the list}
liste <- find_item("Florian") #Limit is set by default to 10 results, it should be enough here.
liste
```
Now we have a list with 10 results, and we want to extract the id of the first entry with "given name", or the first entry. I wrapped these steps into a function.
```{r Getting the id}
selectionnerID<-function(nom){
liste <- find_item(nom)
id <-""
for(i in 1:length(liste)){
if(is.null(liste[[i]]$description)){
next
}
if(grepl("given name",liste[[i]]$description)){
id <- liste[[i]]$id
break # we break on the first given name we find
}
}
# If we haven't found given name, then we take the first id
if(id==""){
id<-liste[[1]]$id
}
return(id)
}
id<-selectionnerID("Florian")
get_item(id) #Sanity check
```
### Modifying the query
We will query wikidata with a query I came up with, and modifying the given name id we got. The place where to replace is given by **REPLACE_ID**, so that's what we will be looking for to replace.
The parameters are :
* __P735__ for given name
* __P106__ for occupation
* __P569__ for the date of birt
* __P27__ for the country
* Labels are the french ones, with a fallback in english.
You can find the parameters code using `find_property("parameter")` from WikidataR.
```{r}
endpoint <- "https://query.wikidata.org/bigdata/namespace/wdq/sparql"
#endpoint <- "https://query.wikidata.org/sparql"
prefix<-c("wd","<http://www.wikidata.org/entity/>",
"wdt", "<http://www.wikidata.org/prop/direct/>",
"wikibase","<http://wikiba.se/ontology#>")
generic_query <-"SELECT ?item ?itemLabel ?occupationLabel ?paysLabel ?annee
WHERE
{
?item wdt:P735 wd:REPLACE_ID .
?item wdt:P106 ?occupation .
OPTIONAL {?item wdt:P569 ?anneeN} .
OPTIONAL {?item wdt:P27 ?pays} .
BIND(YEAR(?anneeN) as ?annee) .
SERVICE wikibase:label { bd:serviceParam wikibase:language \"fr,en\" }
}
ORDER BY DESC (?annee)"
query<-sub("REPLACE_ID",id,generic_query)
```
### Executing the request
We will use the SPARQL library to get the dataset.
```{r Query}
source("mySPARQL.R")
res <- mySPARQL(endpoint,query,ns=prefix,format = "xml")
resultats<- res$results
summary(resultats)
glimpse(resultats)
```
### Cleaning the data
We need to format the data we got. The selectionnerNom function will fetch the middle part of the returned column, which is in the form `"GivenName LastName"@fr`. So we will apply this function to the "labels" columns (itemLabel, occupationLabel and paysLabel), and then convert those columns to factors.
```{r Cleaning the data}
selectionnerNom<-function(x,colonne,nom) {
a<-x %>%
separate_(colonne,c("debut",nom,"fin"),sep="\"",remove=TRUE) %>%
select(-debut,-fin)
a[,nom]<-factor(a[,nom])
return(a)
}
resultats<-selectionnerNom(resultats,"itemLabel","nom")
resultats<-selectionnerNom(resultats,"occupationLabel","metier")
resultats<-selectionnerNom(resultats,"paysLabel","pays")
resultats<-tbl_df(resultats)
resultats
```
```{r Word Cloud}
library(wordcloud)
tousLesMetiers<-resultats %>%
group_by(metier) %>%
count(metier)
set.seed(1)
wordcloud(tousLesMetiers$metier,tousLesMetiers$n,scale=c(3,0.25),min.freq=2,colors=brewer.pal(9,"OrRd")[c(5,6,7,8,9)],random.order = FALSE,rot.per=0.3)
```
Some descriptions are quite long, and sometimes the most occuring profession wouldn't fit on the wordcloud. I made the function that adds a newline character near the middle of the descriptions, if there is a space or a dash.
```{r Cutting words for better wordcloud printing}
couperMot<-function(mot){
#coupe une chaine de caractère sur l'espace le plus proche du milieu
z=as.character(mot)
mi=ceiling(str_length(z)/2)
a<-str_locate_all(z,"[ -]")[[1]]
if (length(a)==0){ #S'il n'y a pas d'espace ou tiret
return(z)
}
else { #on prend le premier espace après le milieu
b<-a[(a-mi>0)[,1],1][1]
if (length(b)==0){ #Si les espaces ou tiret sont avant le mileu
b<-a[1,1]
}
str_sub(z,b,b)<-"\n"
return(z)
}
}
tousLesMetiers$new<-mapply(couperMot,tousLesMetiers$metier)
wordcloud(tousLesMetiers$new,tousLesMetiers$n,scale=c(3,0.25),min.freq=2,colors=brewer.pal(9,"OrRd")[c(5,6,7,8,9)],random.order = FALSE,rot.per=0.3)
```
```{r Wrapping it all together}
gettingData<-function(prenom){
id<-selectionnerID(prenom)
#Building the query
endpoint <- "https://query.wikidata.org/bigdata/namespace/wdq/sparql"
prefix<-c("wd","<http://www.wikidata.org/entity/>",
"wdt", "<http://www.wikidata.org/prop/direct/>",
"wikibase","<http://wikiba.se/ontology#>")
generic_query <-"SELECT ?item ?itemLabel ?occupationLabel ?paysLabel ?annee
WHERE
{
?item wdt:P735 wd:REPLACE_ID .
?item wdt:P106 ?occupation .
OPTIONAL {?item wdt:P569 ?anneeN} .
OPTIONAL {?item wdt:P27 ?pays} .
BIND(YEAR(?anneeN) as ?annee) .
SERVICE wikibase:label { bd:serviceParam wikibase:language \"fr,en\" }
}
ORDER BY DESC (?annee)"
query<-sub("REPLACE_ID",id,generic_query)
#Getting te data
results<-mySPARQL(endpoint,query,ns=prefix,format = "xml")$results
#Cleaning the data
results<-selectionnerNom(results,"itemLabel","nom")
results<-selectionnerNom(results,"occupationLabel","metier")
results<-selectionnerNom(results,"paysLabel","pays")
return(results)
}
occupationWordcloud<-function(results){
everyOccupation<-results %>%
group_by(metier) %>%
count(metier)
everyOccupation$occ<-mapply(couperMot,everyOccupation$metier)
set.seed(14)
wordcloud(everyOccupation$occ,everyOccupation$n,
scale=c(3,0.25),min.freq=2,
colors=brewer.pal(6,"Dark2"),
random.order = FALSE,rot.per=0.3)
}
occupationGraph<-function(results){
prenom<-word(results$nom[1])
ggplotly(ggplot(results %>% distinct(item),aes(x=annee))+
geom_histogram(binwidth = 10,aes(fill=pays))+
ggtitle(paste("Répartition des", prenom,"dans Wikidata"))+
ylab("Nombre")+
xlab("Année de naissance"))
}
```
```{r Examples}
florian<-gettingData("Florian")
occupationWordcloud(florian)
occupationGraph(florian)
```