Skip to content
Raven Computing edited this page Mar 28, 2021 · 5 revisions

DataFrames in Python

The Pydf library implements the DataFrame specification version 2.0 for Python.

The unified documentation can be accessed here.

Additional Features

All features provided by the Python implementation which are not officially defined by the specification are documented on this page for further reference.

Using the []-operator

Besides with the standard get and set methods, you can also access and change columns, rows and individual DataFrame elements via the []-operator. It can optionally be used as it makes code less verbose.

The []-operator can be used with either one or two selection arguments. Using only one argument will select a specific column, whereas adding a second argument will select a specific row entry inside that column. Columns can always be selected both by index and by name, just like when using the standard API methods.

Accessing Values

You can select specific single values by specifying both the column index/name and the row index. For example, the following code will return the value inside the 'B' column at the row index 3:

print(df)
# _| A     B  C     D
# 0| Bill  34 True  a
# 1| Bob   36 False b
# 2| Mark  25 True  c
# 3| Sofia 31 True  b
# 4| Paul  29 True  a
# 5| Simon 21 False b

val = df["B", 3]

print(val)
# 31

The type of the val variable depends on the type of the selected column. For example, if the 'B' column was an IntColumn then the above operation would return a Python int object, if the 'B' column was a DoubleColumn then a Python float object would be returned, and so on.

When using the []-operator you can also specify column and row indices as negative numbers. The behaviour is similar to that of Python lists. The following example selects the last entry in the 'A' column:

val = df["A", -1]

print(val)
# Simon

Multiple values can be selected in various ways. For example, you can select more than one single value inside a specific column:

vals = df["B", (1, 3, 5)]

print(vals)
# _| B
# 0| 36
# 1| 31
# 2| 21

The above code will return a DataFrame with one column (i.e. the 'B' column) and three rows which have the values at the corresponding indices. Please note that you can also select rows multiple times and the order of row indices does not matter. For example, the following code is valid:

vals = df["A", (5, 5, 2, 2)]

print(vals)
# _| A
# 0| Simon
# 1| Simon
# 2| Mark
# 3| Mark

You can also select multiple columns in the same way:

vals = df[("C", "B", "A"), (4, 2, 1)]

print(vals)
# _| C     B  A
# 0| True  29 Paul
# 1| True  25 Mark
# 2| False 36 Bob

Notice how the order of both columns and rows is different than in the original DataFrame. But be careful not to select the same column more than once when using a DataFrame with labeled columns since column names have to be unique.

Selecting columns as a tuple of int or string is equivalent to calling the get_columns() method.

Rows

You can access a single row by specifying the row index as a single int. For example, the following code returns the row at index 4:

row = df[:, 4]

print(row)
# ['Paul', 29, True, 'a']

The row is returned as a Python list object and the above operation is therefore equivalent to calling the get_row() method on a DataFrame. However, when using the []-operator you can also select specific values within a row by specifically selecting columns:

row = df[("A", "C"), 4]

print(row)
# ['Paul', True]

Therefore, you can keep in mind that when only one row is specified (as a single int) then a single row is returned as a list, but when multiple rows are selected (as a tuple of int) then a DataFrame is returned which contains all selected rows.

Filtering

You can also use the []-operator to quickly filter a DataFrame. This is equivalent to calling the filter() method. The following example illustrates this:

print(df)
# _| A     B  C     D
# 0| Bill  34 True  a
# 1| Bob   36 False b
# 2| Mark  25 True  c
# 3| Sofia 31 True  b
# 4| Paul  29 True  a
# 5| Simon 21 False b

vals = df["D", "a|c"]

print(vals)
# _| A    B  C    D
# 0| Bill 34 True a
# 2| Mark 25 True c
# 4| Paul 29 True a

The above code is equivalent to calling df.filter("D", "a|c"). The search term has to be specified as a string.

Setting Values

The entities selected by the []-operator can be set to a new value. For example:

print(df)
# _| A     B  C     D
# 0| Bill  34 True  a
# 1| Bob   36 False b
# 2| Mark  25 True  c
# 3| Sofia 31 True  b
# 4| Paul  29 True  a
# 5| Simon 21 False b

df["B", 3] = 42

print(df)
# _| A     B  C     D
# 0| Bill  34 True  a
# 1| Bob   36 False b
# 2| Mark  25 True  c
# 3| Sofia 42 True  b
# 4| Paul  29 True  a
# 5| Simon 21 False b

The above code sets the value in the 'B' column in the row at index 3 to the value 42. This operation is equivalent to calling the appropriate set_*() method.

Rows can be set altogether by supplying a Python list. For example:

df[:, 3] = ["Hans", 40, False, "c"]

print(df)
# _| A     B  C     D
# 0| Bill  34 True  a
# 1| Bob   36 False b
# 2| Mark  25 True  c
# 3| Hans  40 False c
# 4| Paul  29 True  a
# 5| Simon 21 False b

You can also only provide values for specific columns but then you have to directly specify the columns to set values in. For example:

print(df)
# _| A     B  C     D
# 0| Bill  34 True  a
# 1| Bob   36 False b
# 2| Mark  25 True  c
# 3| Sofia 31 True  b
# 4| Paul  29 True  a
# 5| Simon 21 False b

df[("A", "C"), 3] = ["Hans", False]

print(df)
# _| A     B  C     D
# 0| Bill  34 True  a
# 1| Bob   36 False b
# 2| Mark  25 True  c
# 3| Hans  31 False b
# 4| Paul  29 True  a
# 5| Simon 21 False b

Replacement Function

You can replace all values that match a specified regular expression with either a constant value or dynamically by a function. This operation is equivalent to calling the replace() method.

Let's first look at a basic example using a constant value:

print(df)
# _| A     B  C     D
# 0| Bill  34 True  a
# 1| Bob   36 False b
# 2| Mark  25 True  c
# 3| Sofia 31 True  b
# 4| Paul  29 True  a
# 5| Simon 21 False b

df["D", "b|c"] = "z"

print(df)
# _| A     B  C     D
# 0| Bill  34 True  a
# 1| Bob   36 False z
# 2| Mark  25 True  z
# 3| Sofia 31 True  z
# 4| Paul  29 True  a
# 5| Simon 21 False z

As you can see, all matched entries inside the 'D' column have been replaced by the specified value. But just like with the standard replace() method, you can also specify the replacement value as a lambda or normal function:

print(df)
# _| A     B  C     D
# 0| Bill  34 True  a
# 1| Bob   36 False b
# 2| Mark  25 True  c
# 3| Sofia 31 True  b
# 4| Paul  29 True  a
# 5| Simon 21 False b

df["D", "b|c"] = lambda i, v: "z" if df["C", i] else "k"

print(df)
# _| A     B  C     D
# 0| Bill  34 True  a
# 1| Bob   36 False k
# 2| Mark  25 True  z
# 3| Sofia 31 True  z
# 4| Paul  29 True  a
# 5| Simon 21 False k

The above code changes the 'b' and 'c' values inside the 'D' column to a 'z' if the corresponding boolean value in the 'C' column (at the same row index) is True, otherwise it changes it to a 'k' character.

Please note that the case where the replacement argument is a DataFrame like in the replace() method is not implemented by the []-operator.

Slicing

In Python you can use the built-in slicing mechanism to select elements from a list. DataFrames also allow you to apply the same syntax for both column and row selection. For example:

print(df)
# _| A     B  C     D
# 0| Bill  34 True  a
# 1| Bob   36 False b
# 2| Mark  25 True  c
# 3| Sofia 31 True  b
# 4| Paul  29 True  a
# 5| Simon 21 False b

vals = df[1:3, 2:5]

print(vals)
# _| B  C
# 0| 25 True
# 1| 31 True
# 2| 29 True

Please note that when slicing all columns and rows have to be specified by index.