Skip to content

Commit 1e866e0

Browse files
committed
Merge cleanup porting DataTables#66 over.
1 parent 4c33c8b commit 1e866e0

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

55 files changed

+1379
-5843
lines changed

REQUIRE

-2
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,5 @@ CategoricalArrays 0.2.0
44
StatsBase 0.11.0
55
SortingAlgorithms
66
Reexport
7-
Compat 0.19.0
87
WeakRefStrings 0.3.0
98
DataStreams 0.2.0
10-
CSV 0.2.0

docs/src/man/categorical.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -45,9 +45,9 @@ cv = categorical(v)
4545
Or you can edit the columns of a `DataFrame` in-place using the `categorical!` function:
4646

4747
```julia
48-
dt = DataFrame(A = [1, 1, 1, 2, 2, 2],
48+
df = DataFrame(A = [1, 1, 1, 2, 2, 2],
4949
B = ["X", "X", "X", "Y", "Y", "Y"])
50-
categorical!(dt, [:A, :B])
50+
categorical!(df, [:A, :B])
5151
```
5252

5353
Using categorical arrays is important for working with the [GLM package](https://github.com/JuliaStats/GLM.jl). When fitting regression models, `CategoricalArray` columns in the input are translated into 0/1 indicator columns in the `ModelMatrix` with one column for each of the levels of the `CategoricalArray`. This allows one to analyze categorical data efficiently.

docs/src/man/getting_started.md

+19-19
Original file line numberDiff line numberDiff line change
@@ -107,59 +107,59 @@ julia> nulls(Int, 1, 3)
107107
The `DataFrame` type can be used to represent data tables, each column of which is a vector. You can specify the columns using keyword arguments:
108108
109109
```julia
110-
dt = DataFrame(A = 1:4, B = ["M", "F", "F", "M"])
110+
df = DataFrame(A = 1:4, B = ["M", "F", "F", "M"])
111111
```
112112
113113
It is also possible to construct a `DataFrame` in stages:
114114
115115
```julia
116-
dt = DataFrame()
117-
dt[:A] = 1:8
118-
dt[:B] = ["M", "F", "F", "M", "F", "M", "M", "F"]
119-
dt
116+
df = DataFrame()
117+
df[:A] = 1:8
118+
df[:B] = ["M", "F", "F", "M", "F", "M", "M", "F"]
119+
df
120120
```
121121
122122
The `DataFrame` we build in this way has 8 rows and 2 columns. You can check this using `size` function:
123123
124124
```julia
125-
nrows = size(dt, 1)
126-
ncols = size(dt, 2)
125+
nrows = size(df, 1)
126+
ncols = size(df, 2)
127127
```
128128
129129
We can also look at small subsets of the data in a couple of different ways:
130130
131131
```julia
132-
head(dt)
133-
tail(dt)
132+
head(df)
133+
tail(df)
134134

135-
dt[1:3, :]
135+
df[1:3, :]
136136
```
137137
138138
Having seen what some of the rows look like, we can try to summarize the entire data set using `describe`:
139139
140140
```julia
141-
describe(dt)
141+
describe(df)
142142
```
143143
144144
To focus our search, we start looking at just the means and medians of specific columns. In the example below, we use numeric indexing to access the columns of the `DataFrame`:
145145
146146
```julia
147-
mean(Nulls.skip(dt[1]))
148-
median(Nulls.skip(dt[1]))
147+
mean(Nulls.skip(df[1]))
148+
median(Nulls.skip(df[1]))
149149
```
150150
151151
We could also have used column names to access individual columns:
152152
153153
```julia
154-
mean(Nulls.skip(dt[:A]))
155-
median(Nulls.skip(dt[:A]))
154+
mean(Nulls.skip(df[:A]))
155+
median(Nulls.skip(df[:A]))
156156
```
157157
158158
We can also apply a function to each column of a `DataFrame` with the `colwise` function. For example:
159159
160160
```julia
161-
dt = DataFrame(A = 1:4, B = randn(4))
162-
colwise(c->cumsum(Nulls.skip(c)), dt)
161+
df = DataFrame(A = 1:4, B = randn(4))
162+
colwise(c->cumsum(Nulls.skip(c)), df)
163163
```
164164
165165
## Importing and Exporting Data (I/O)
@@ -191,8 +191,8 @@ a `DataFrame` rather than the default `DataFrame`. Keyword arguments may be pass
191191
192192
A DataFrame can be written to a CSV file at path `output` using
193193
```julia
194-
dt = DataFrame(x = 1, y = 2)
195-
CSV.write(output, dt)
194+
df = DataFrame(x = 1, y = 2)
195+
CSV.write(output, df)
196196
```
197197
198198
For more information, use the REPL [help-mode](http://docs.julialang.org/en/stable/manual/interacting-with-julia/#help-mode) or checkout the online [CSV.jl documentation](https://juliadata.github.io/CSV.jl/stable/)!

docs/src/man/joins.md

-20
Original file line numberDiff line numberDiff line change
@@ -35,13 +35,8 @@ There are seven kinds of joins supported by the DataFrames package:
3535
You can control the kind of join that `join` performs using the `kind` keyword argument:
3636

3737
```julia
38-
<<<<<<< HEAD
3938
a = DataFrame(ID = [20, 40], Name = ["John Doe", "Jane Doe"])
4039
b = DataFrame(ID = [20, 60], Job = ["Lawyer", "Astronaut"])
41-
=======
42-
a = DataFrame(ID = [20, 40], Name = ["John Doe", "Jane Doe"])
43-
b = DataFrame(ID = [20, 60], Job = ["Lawyer", "Astronaut"])
44-
>>>>>>> b196630fa9ba02372a25dec222425d9b804f5fd5
4540
join(a, b, on = :ID, kind = :inner)
4641
join(a, b, on = :ID, kind = :left)
4742
join(a, b, on = :ID, kind = :right)
@@ -56,37 +51,22 @@ Cross joins are the only kind of join that does not use a key:
5651
join(a, b, kind = :cross)
5752
```
5853

59-
<<<<<<< HEAD
60-
In order to join data frames on keys which have different names, you must first rename them so that they match. This can be done using rename!:
61-
62-
```julia
63-
a = DataFrame(ID = [20, 40], Name = ["John Doe", "Jane Doe"])
64-
b = DataFrame(IDNew = [20, 40], Job = ["Lawyer", "Doctor"])
65-
=======
6654
In order to join data tables on keys which have different names, you must first rename them so that they match. This can be done using rename!:
6755

6856
```julia
6957
a = DataFrame(ID = [20, 40], Name = ["John Doe", "Jane Doe"])
7058
b = DataFrame(IDNew = [20, 40], Job = ["Lawyer", "Doctor"])
71-
>>>>>>> b196630fa9ba02372a25dec222425d9b804f5fd5
7259
rename!(b, :IDNew, :ID)
7360
join(a, b, on = :ID, kind = :inner)
7461
```
7562

7663
Or renaming multiple columns at a time:
7764

7865
```julia
79-
<<<<<<< HEAD
80-
a = DataFrame(City = ["Amsterdam", "London", "London", "New York", "New York"],
81-
Job = ["Lawyer", "Lawyer", "Lawyer", "Doctor", "Doctor"],
82-
Category = [1, 2, 3, 4, 5])
83-
b = DataFrame(Location = ["Amsterdam", "London", "London", "New York", "New York"],
84-
=======
8566
a = DataFrame(City = ["Amsterdam", "London", "London", "New York", "New York"],
8667
Job = ["Lawyer", "Lawyer", "Lawyer", "Doctor", "Doctor"],
8768
Category = [1, 2, 3, 4, 5])
8869
b = DataFrame(Location = ["Amsterdam", "London", "London", "New York", "New York"],
89-
>>>>>>> b196630fa9ba02372a25dec222425d9b804f5fd5
9070
Work = ["Lawyer", "Lawyer", "Lawyer", "Doctor", "Doctor"],
9171
Name = ["a", "b", "c", "d", "e"])
9272
rename!(b, [:Location => :City, :Work => :Job])

docs/src/man/reshaping_and_pivoting.md

+7-7
Original file line numberDiff line numberDiff line change
@@ -43,20 +43,20 @@ d = stack(iris)
4343
`unstack` converts from a long format to a wide format. The default is requires specifying which columns are an id variable, column variable names, and column values:
4444

4545
```julia
46-
longdt = melt(iris, [:Species, :id])
47-
widedt = unstack(longdt, :id, :variable, :value)
46+
longdf = melt(iris, [:Species, :id])
47+
widedf = unstack(longdf, :id, :variable, :value)
4848
```
4949

5050
If the remaining columns are unique, you can skip the id variable and use:
5151

5252
```julia
53-
widedt = unstack(longdt, :variable, :value)
53+
widedf = unstack(longdf, :variable, :value)
5454
```
5555

56-
`stackdt` and `meltdt` are two additional functions that work like `stack` and `melt`, but they provide a view into the original wide DataFrame. Here is an example:
56+
`stackdf` and `meltdf` are two additional functions that work like `stack` and `melt`, but they provide a view into the original wide DataFrame. Here is an example:
5757

5858
```julia
59-
d = stackdt(iris)
59+
d = stackdf(iris)
6060
```
6161

6262
This saves memory. To create the view, several AbstractVectors are defined:
@@ -73,13 +73,13 @@ This repeats the original columns N times where N is the number of columns stack
7373
For more details on the storage representation, see:
7474

7575
```julia
76-
dump(stackdt(iris))
76+
dump(stackdf(iris))
7777
```
7878

7979
None of these reshaping functions perform any aggregation. To do aggregation, use the split-apply-combine functions in combination with reshaping. Here is an example:
8080

8181
```julia
8282
d = stack(iris)
83-
x = by(d, [:variable, :Species], dt -> DataFrame(vsum = mean(Nulls.skip(dt[:value]))))
83+
x = by(d, [:variable, :Species], df -> DataFrame(vsum = mean(Nulls.skip(df[:value]))))
8484
unstack(x, :Species, :vsum)
8585
```

docs/src/man/split_apply_combine.md

+6-6
Original file line numberDiff line numberDiff line change
@@ -12,15 +12,15 @@ using CSV
1212
iris = CSV.read(joinpath(Pkg.dir("DataFrames"), "test/data/iris.csv"), DataFrame)
1313

1414
by(iris, :Species, size)
15-
by(iris, :Species, dt -> mean(Nulls.skip(dt[:PetalLength])))
16-
by(iris, :Species, dt -> DataFrame(N = size(dt, 1)))
15+
by(iris, :Species, df -> mean(Nulls.skip(df[:PetalLength])))
16+
by(iris, :Species, df -> DataFrame(N = size(df, 1)))
1717
```
1818

1919
The `by` function also support the `do` block form:
2020

2121
```julia
22-
by(iris, :Species) do dt
23-
DataFrame(m = mean(Nulls.skip(dt[:PetalLength])), s² = var(Nulls.skip(dt[:PetalLength])))
22+
by(iris, :Species) do df
23+
DataFrame(m = mean(Nulls.skip(df[:PetalLength])), s² = var(Nulls.skip(df[:PetalLength])))
2424
end
2525
```
2626

@@ -36,7 +36,7 @@ aggregate(iris, :Species, [sum, x->mean(Nulls.skip(x))])
3636
If you only want to split the data set into subsets, use the `groupby` function:
3737

3838
```julia
39-
for subdt in groupby(iris, :Species)
40-
println(size(subdt, 1))
39+
for subdf in groupby(iris, :Species)
40+
println(size(subdf, 1))
4141
end
4242
```

docs/src/man/subsets.md

+6-6
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ julia> df = DataFrame(A = 1:10, B = 2:2:20)
2424
Referring to the first column by index or name:
2525

2626
```julia
27-
julia> dt[1]
27+
julia> df[1]
2828
10-element Array{Int64,1}:
2929
1
3030
2
@@ -37,7 +37,7 @@ julia> dt[1]
3737
9
3838
10
3939

40-
julia> dt[:A]
40+
julia> df[:A]
4141
10-element Array{Int64,1}:
4242
1
4343
2
@@ -54,25 +54,25 @@ julia> dt[:A]
5454
Refering to the first element of the first column:
5555

5656
```julia
57-
julia> dt[1, 1]
57+
julia> df[1, 1]
5858
1
5959

60-
julia> dt[1, :A]
60+
julia> df[1, :A]
6161
1
6262
```
6363

6464
Selecting a subset of rows by index and an (ordered) subset of columns by name:
6565

6666
```julia
67-
julia> dt[1:3, [:A, :B]]
67+
julia> df[1:3, [:A, :B]]
6868
3×2 DataFrames.DataFrame
6969
│ Row │ A │ B │
7070
├─────┼───┼───┤
7171
112
7272
224
7373
336
7474

75-
julia> dt[1:3, [:B, :A]]
75+
julia> df[1:3, [:B, :A]]
7676
3×2 DataFrames.DataFrame
7777
│ Row │ B │ A │
7878
├─────┼───┼───┤

src/DataFrames.jl

+8-19
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,4 @@
1-
__precompile__()
2-
1+
__precompile__(true)
32
module DataFrames
43

54
##############################################################################
@@ -8,12 +7,9 @@ module DataFrames
87
##
98
##############################################################################
109

11-
using Reexport
12-
using StatsBase
13-
import NullableArrays: dropnull, dropnull!
14-
@reexport using NullableArrays
15-
@reexport using CategoricalArrays
16-
using SortingAlgorithms
10+
using Reexport, StatsBase, SortingAlgorithms
11+
@reexport using CategoricalArrays, Nulls
12+
1713
using Base: Sort, Order
1814
import Base: ==, |>
1915

@@ -23,14 +19,7 @@ import Base: ==, |>
2319
##
2420
##############################################################################
2521

26-
export @~,
27-
@csv_str,
28-
@csv2_str,
29-
@formula,
30-
@tsv_str,
31-
@wsv_str,
32-
33-
AbstractDataFrame,
22+
export AbstractDataFrame,
3423
DataFrame,
3524
DataFrameRow,
3625
GroupApplied,
@@ -51,7 +40,6 @@ export @~,
5140
eachrow,
5241
eltypes,
5342
groupby,
54-
head,
5543
melt,
5644
meltdf,
5745
names!,
@@ -66,7 +54,6 @@ export @~,
6654
showcols,
6755
stack,
6856
stackdf,
69-
tail,
7057
unique!,
7158
unstack,
7259
head,
@@ -83,13 +70,15 @@ export @~,
8370
##
8471
##############################################################################
8572

73+
const _displaysize = Base.displaysize
74+
8675
for (dir, filename) in [
8776
("other", "utils.jl"),
8877
("other", "index.jl"),
8978

9079
("abstractdataframe", "abstractdataframe.jl"),
9180
("dataframe", "dataframe.jl"),
92-
("subdataframe", "subdataframe.jl"),
81+
("dataframe", "dataframe.jl"),
9382
("groupeddataframe", "grouping.jl"),
9483
("dataframerow", "dataframerow.jl"),
9584
("dataframerow", "utils.jl"),

0 commit comments

Comments
 (0)