-
Notifications
You must be signed in to change notification settings - Fork 7
User Reference
Scrunch is a Pythonic scripting library for cleaning data in Crunch. The following document describes the basic usage of scrunch.
Let’s start with a simple session and getting a Crunch dataset.
Use the connect
method to authenticate with the Crunch server.
from scrunch import connect, get_dataset
connection = connect(api_key="<your api key>", site_url="https://your-domain.crunch.io/api/")
Or, if you don't have an API Key:
connection = connect(user="[email protected]", pw="yourpassword", site_url="https://your-domain.crunch.io/api/")
The library will automatically detect the provided credentials and will create a persistent session from this point on.
If an explicit call to connect()
hasn't been made, the first call to get_dataset()
or get_project()
will establish a default connection to the default Crunch host.
The default connection will first look at environment variable CRUNCH_API_KEY
or a crunch.ini
file in the local path with the following format:
[DEFAULT] CRUNCH_API_KEY=apikeysecret
When we get a dataset the existing connection is used (see above).
Getting a dataset as simple as:
ds = get_dataset("Dataset X")
Sometimes datasets will belong to a project. In this case we need to pass the project parameter with the name or the id of the project to the get_dataset function.
ds = get_dataset("Dataset X", project="Project Y")
When we get a project the existing connection is used (see above).
Getting a project as simple as:
pr = get_project("Project X")
You can create a nested project by using create_project
:
pr.create_project("Project Y")
You can rename an existing subfolder using rename
:
pr.rename("Project Z")
You can delete an empty project by using delete
:
pr.delete()
If you know the name
or id
of a nested project you can get
it in the following way:
pr_b = pr_a.get('Project B')
You can iterate over the contents of a project by usingchildren
. Items here will be either datasets or projects (which can themselves be iterated over in turn):
for child in pr.children:
if isinstance(child, Dataset):
print("I am dataset: %s" % child.name)
elif isinstance(child, Project):
print("I am a project: %s" % child.name)
To add datasets or projects to a project use move_here
. By default the given items will be appended at the end. Optionally, a specific position
, or a relative before
or after
position, can be given:
pr.move_here(items=[ds])
pr.move_here(items=[ds1, ds2], position=0)
pr.move_here(items=[ds1, pr1], before='Dataset X')
pr.move_here(items=[ds1, ds2], after='Dataset Y')
Note that items
must always be given as a list even if there is only one item.
To move a dataset into your Personal project, follow a similar approach to what is described above except use get_personal_project
to initially get the project. With this approach the dataset will be removed from its current project and anyone for whom the dataset has been shared directly will be able to find it in their Personal Project:
pr = scrunch.get_personal_project()
To see what members have access to the given project you need to access the Members
instance on each project using:
members = pr.members
To see the email addresses of all members attached to the project use list
:
>>> pr.members.list
[u'[email protected]',
u'[email protected]']
Add or remove members of the project in the following way:
pr.members.add('[email protected]')
pr.members.remove('[email protected]')
To change the permissions of a member:
pr.members.edit('[email protected]', {'team_admin': True})
pr.members.edit('[email protected]', {'edit': True})
Teams are a convenient way to manage groups of users as a single unit.
You can share datasets with teams as you can with a single user. You can also add a team as a member of a project.
Create a new team (you will be the only team member and you will be set as the team's admin:
hogwarts = scrunch.create_team("Hogwarts")
You can get an existing team in the following way:
hogwarts = scrunch.get_team("Hogwarts")
You can access and edit the members of a team in the following way. If you add a member using edit=True
that team member will also become a team admin.
>>> members = hogwarts.members
>>> members.list()
[u'[email protected]']
>>> hogwarts.members.add('[email protected]', edit=True)
>>> hogwarts.members.add('[email protected]')
>>> hogwarts.members.add('[email protected]')
>>> hogwarts.members.add('[email protected]')
>>> hogwarts.members.list()
[u'[email protected]',
u'[email protected]',
u'[email protected]',
u'[email protected]',
u'[email protected]']
>>> hogwarts.members.remove('[email protected]')
>>> hogwarts.members.list()
[u'[email protected]',
u'[email protected]',
u'[email protected]',
u'[email protected]']
>>> hogwarts.members.edit('[email protected]', permissions={'team_admin': True})
One crucial step when manipulating a dataset is that you need to be the dataset's current editor. There are two ways to achieve this.
Either take editorship when you get the dataset:
ds = get_dataset("Example Dataset", editor=True)
Or set yourself as the editor by calling:
ds.change_editor(user='[email protected]')
Becoming the editor will allow you to edit the dataset name
and description
:
ds.edit(name='Brand Tracker', description='Waves 1-50', notes='Results current as of 24 Nov 2016')
You can also add new users (with editor or viewer privileges), edit the privileges of existing user or remove an existing user's access to the dataset completely:
# add a new user with editor privileges
ds.add_user(email, edit=True)
# add multiple new users with a list
ds.add_user([email_1, email_2], edit=True)
# change an existing user's privileges to viewer
ds.edit_user(email, edit=False)
# remove an existing user
ds.remove_user(email)
You can also check who the current editor is:
>>> ds.editor
"<User: email='[email protected]'; id='892397ace5154b41b99b2dc04bd3ffc4'>"
Information on current owner is available in a similar way:
>>> ds.owner
"<User: email='[email protected]'; id='f0df1fcb2d1e406d8361e6403d7b645d'>"
The owner of a dataset can be either a user or a project. In both cases the change_owner
method is what you need to pass ownership between entities:
ds.change_owner(user='[email protected]')
ds.change_owner(project='Coca-Cola')
The settings of the dataset can be accessed via the settings property:
>>> ds.settings
{
'viewers_can_export': False,
'min_base_size': 0,
'weight': None,
'viewers_can_change_weight': False
}
The dataset.settings
property is a read-only dictionary.
Modification of settings is done through the Dataset.change_settings
method, which currently takes the viewers_can_export
and viewers_can_change_weight
parameters in order to change the values of those dataset settings:
ds.change_settings(viewers_can_export=True)
ds.change_settings(viewers_can_export=False, viewers_can_change_weight=True)
The rest of the settings are currently read-only. Trying to change invalid or read-only settings raises a ValueError
exception.
The list of known weight variables can be accessed with the following code:
>>> ds.weights
[u'weight', u'Weight-age-nets']
Savepoints works as snapshots of the current state of the dataset. You can return to a savepoint later to undo whatever has been done to the dataset in the meantime.
You can create savepoints with the following:
ds.create_savepoint('savepoint description')
To revert changes and return to the previous savepoint do the following:
ds.load_savepoint('savepoint description')
When loading a savepoint all savepoints after the one that is being loaded will be lost permanently.
While editing the dataset you may want to work on a fork, which is a copy of the dataset that can be merged back onto the original later. The parent dataset keeps a record of any forks that may have been created from it.
Starting with the instance of a dataset, use the fork()
method to create a fork.
ds_fork = ds.fork()
The resulting object is also a scrunch.datasets.Dataset
instance and you can use it as with any other dataset instance. The fork
The fork can be given a name and a description if desired.
ds_fork = ds.fork(name="Fork #1", description="Working version")
If name
is not given then the one will be parsed automatically from ds.name
in the form "FORK #{} of {}".format(nforks + 1, ds.name)
, where nforks
is the current number of forks attached to ds
. It is usually better to allow the fork name to be parsed in this way. Provide a less technical description
if needed.
The owner of the fork is the user that created it unless the preserve_owner
parameter is set to True
, in which case the fork will be owned by the same User as its parent dataset:
ds_fork = ds.fork(preserve_owner=True)
If the owner of the parent dataset is a Crunch Project (instead of a regular Crunch User), then the ownership will be preserved and the fork will be part of the same Project as its parent dataset, regardless of the preserve_owner
parameter.
The same way you can delete a dataset, the fork can be deleted calling the .delete() method:
ds_fork.delete()
It is also possible to delete all forks, just use the .delete_forks() method from the dataset. Note that here we are using the ds
object, not the ds_fork
.`
ds.delete_forks()
It's also possible to merge a fork back into its parent dataset using the merge
method. This can be done by providing either the fork id
, it's name
or in cases where the fork name has been auto-parsed (as described above) then the fork number can be given either as int
or str
.
ds.merge(ds_fork.id)
ds.merge("Fork #1")
ds.merge(1)
ds.merge("1")
Sometimes when a fork is merged back into its parent dataset conflicts between the two are exposed. autorollback
provides some control in this situation. This is an optional argument that is True
by default.
ds.merge(1, autorollback=False)
If True
the parent dataset will be rolled back to the previous savepoint
in case of a merge conflict.
If False
the parent dataset and fork will being left as-is.
Logical expressions can be used in any place where a description or query against the dataset is needed. They can be composed using the following logical operators:
operator | meaning |
---|---|
== | equal |
!= | unequal |
> | greater than |
>= | greater or equal |
< | less than |
<= | less or equal |
and | logical and |
or | logical or |
in | in list/tuple |
any | any(list/tuple) |
all | all(list/tuple) |
not | not in/any/all |
r | r(lower, upper) |
The r()
helper function declares a range of integers but it must be wrapped in a brackets []
.
The expression [r(1,3), 6, 7]
automatically translates to [1, 2, 3, 6, 7]
.
For example:
"age in [r(16,24)]"
"q1.any([r(1,1000)]"
Compound expressions are also possible, such as:
"disposition != 0 or exit_status != 1"
"(age in [r(35, 54)] and gender == 1) or region == 'London'"
Date expressions are supported as long as they are ISO8601 compliant. That means that you may also provide only a partial representation of a complete ISO8601 datetime string as in the following examples:
ds.exclude('starttime < "2015-01-01T00:00:00+00:00"') # complete iso8601 string
ds.exclude('starttime < "2015-01-01T12:30"') # day, hour, minute
ds.exclude('starttime < "2015-01-01T12"') # day, hour
ds.exclude('starttime < "2015-01-01")') # day only
It's possible to permanently delete rows in the dataset using drop_rows
. This cannot be undone except by reverting to a previous savepoint.
To permanently delete rows use the following
ds.drop_rows("disposition != 1")
Exclusion filters are used to hide invalid data. In the context of a survey this could be respondents who spent too little time answering it ("speeders"), rows with inconsistent data, etc.. In Crunch, this is done using exclusion filters, which are specified using a logical expression.
For example, assume that we have a variable with the alias
"disposition" in the dataset. Then apply the exclusion filter:
ds.exclude("disposition != 'complete'")
In the above example disposition is a categorical variable and "complete" is the name
of one of its categories.
An equivalent expression could reference the id
(or code) assigned to the that category instead:
ds.exclude("disposition != 1")
It's possible to filter using two variables. This will compare the categorical names in the two variables:
ds.exclude("income == household_income")
We can also exclude a list referencing either id
or name
and using either brackets []
or parenthesis ()
:
ds.exclude("disposition in [0, 2]")
ds.exclude("disposition in ('incomplete', 'screenout')")
If you want to exclude all missing or valid values for a given variable you can achieve that with the following examples:
ds.exclude("missing(income)")
ds.exclude("valid(income)")
You can also check the exclusion filter on a dataset at any time with the following:
>>> ds.exclude('disposition != 1')
>>> ds.get_exclusion()
"disposition != 1"
This will print the exclusion filter but for a complicated filter the parentheses may be different between what was provided and the get_exclusion
method. Rest assured the meaning of the exclusion filter is not affected by this difference.
This method can also be used to extend your filters in a convenient way:
>>> ds.exclude('disposition != 1')
>>> ds.exclude(ds.get_exclusion() + 'exit_status >= 1'])
>>> ds.get_exclusion()
"disposition != 1 and exit_status >= 1"
Datasets allow viewers and editors to add custom filters which can be reused on many features of the dataset. A filter is defined by a logical expression and is attached to the dataset.
In order to add a filter to a dataset:
ds.add_filter(expr='age > 21 and race == 3', name='myfilter')
The filter can be made public, meaning that all users of the dataset will be able to apply see and apply it if desired.
ds.add_filter(expr='age > 21 and race == 3', name='myfilter', public=True)
Navigating a dataset's existing filters is done via the filters
property.
ds.filters['myfilter']
In this way they are also editable:
ds.filters['myfilter'].edit(
name='18+'
expression='age > 18 and race == 2',
public=False
)
It's possible to append cases from one dataset onto another. It's also possible to join variables from one dataset onto another by matching a unique key in each. However, both of these operations rely on the datasets involved sharing the same metadata in order to avoid conflicts. To discover such conflicts ahead of time it's useful to compare them first.
Let's use the following datasets in our examples:
ds1 = get_dataset('Dataset X')
ds2 = get_dataset('Dataset Y')
Compare the metadata between the two datasets first with the following:
ds1.compare_dataset(ds2)
This will return for you details of any discovered conflicts that may cause problems in a append_dataset
or join
operation. If any such conflicts are found you will need to use what is returned by compare_dataset
to manually resolve those conflicts before continuing on to join or append them.
Use the append_dataset
method to append rows from one dataset onto another.
ds1.append_dataset(
dataset,
filter="Wave in [3, 4, 5]",
variables=None,
autorollback=True
)
dataset
is the dataset instance to append from.
filter
is a logical expression for filtering rows to be appended from the incoming dataset.
variables
a list of variable names to append data onto from the incoming dataset.
autorollback
if True
and the operation fails, the dataset will be roll back to its pre-append state automatically.
These last 3 arguments are optional.
Use the join
method to join variables from one dataset onto another by matching on a unique key.
ds1.join(
left_var='uid',
right_ds=ds2,
right_var='uid',
columns=['languagepreference'],
filter="languagepreference == 3",
wait=True
)
columns
is a list of variables we want to join in the dataset.
filter
is a logical expression for filtering rows in the join operation.
wait
indicates that the method should wait for Crunch to finish the join or return a URL with the progress indicator.
These last 3 arguments are optional.
Variables are available as members of the dataset object, like a dictionary. Like datasets they expose some attributes that can be changed and others that that can only be read.
Editable attributes are: name
, description
, view
, notes
, format
and uniform_basis
.
The immutable attributes are: id
, type
, categories
, discarded
* and url
.
Special attributes: alias
is editable but only for non-streaming datasets.
uniform_basis
is a special attribute for multiple_response
variables. If uniform_basis=True
(the default) the percentage for each subvariable's selected categories will be calculated against the cumulative base for all of the subvariables. If uniform_basis=False
the percentages for each subvariable will instead be calculated independently (a so-called floating base).
To edit them the Variable
class provides the edit
method:
ds['var'].edit(name='a short name', description='a different description')
The discarded
attribute is actually edited using the dedicated helper methods hide
and unhide
:
ds['var'].hide()
ds['var'].unhide()
categorical
variables can be decorated with headings and subtotals, which are special elements that can appear on top of, within or below the variable's categories.
A heading is simply a piece of text. Add a heading with the following.
var = var.add_heading(name='Coke products', categories=1, anchor="top")
var = var.add_heading(name='Sprite products', categories=5, anchor="top")
As you can see in the above example, add_heading()
returns the edited version of the variable. This is very important because if you need to continue working with the variable, including by adding additional headings or subtotals, you must at each step use the most recently returned version of the variable. Otherwise, you will only end up with the last heading or subtotal you added.
name
is the text that will be inserted at the indicated position.
categories
is where you indicate which category is the reference point for the anchor
.
anchor
is where you indicate if the heading should appear at the "top"
or "bottom"
of the named category.
A subtotal is the auto-aggregation of a number of categories. Add a subtotal with the following.
var = var.add_subtotal(name='Coke products', categories=[1,2,3,4], anchor="top")
var = var.add_subtotal(name='Sprite products', categories=[5,6,7], anchor="top")
As you can see in the above example, add_subtotal()
returns the edited version of the variable. This is very important because if you need to continue working with the variable, including by adding additional headings or subtotals, you must at each step use the most recently returned version of the variable. Otherwise, you will only end up with the last heading or subtotal you added.
name
is the text that will be inserted at the indicated position.
categories
is where you indicate which categories should be aggregated together into a new category-like element on the variable.
anchor
is where you indicate if the subtotal should appear at the "top"
or "bottom"
of the named categories.
You can see what transformations exist on a variable (that has been returned by either add_heading
or add_subtotal
!):
>>> var.transformations()
[
{
u'function': u'subtotal',
u'args': [1, 2, 3, 4],
u'anchor': 'Top',
u'name': u'Coke products'
},
{
u'function': u'subtotal',
u'args': [5, 6, 7],
u'anchor': 'Top',
u'name': u'Sprite products'
}
]
To clear all headings/subtotal transformations on a variable use either of the adding methods by passing catetories=None
, for example:
>>> var.add_heading(name="", categories=None)
>>> var.transformations()
[]
This dataset method adds a new material (non-derived) variable by uploading a list of values that will be used to populate it immedately.
For example:
new_var = ds.create_variable(
var_type='categorical',
alias='football_clubs',
name='Football clubs',
description='Which football club do you follow?',
categories=[
{'name': 'ManU', 'id': 1, 'numeric_value': 1, 'missing': False},
{'name': 'Chelsea', 'id': 2, 'numeric_value': 2, 'missing': False},
{'name': 'Totthenham', 'id': 3, 'numeric_value': 3, 'missing': False}
],
values = [2,1,3,1,2,2,3]
)
Supported var_type
s: text
, numeric
, categorical
, datetime
, multiple_response
or categorical_array
.
alias
, name
, description
and notes
will be inherited by the new variable directly.
When var_type
is categorical
or categorical_array
, categories
must be given. categories
is a list in the form given in the above example.
When var_type
is multiple_response
the categories
given must indicated a boolean value for selected
as per the below example. This is what differentiates a multiple_response
from a categorical_array
.
When var_type
is multiple_response
or categorical_array
, subvariables
must be given. subvariables
is a list in the form:
new_var = ds.create_variable(
var_type='multiple_response',
alias='football_clubs',
name='Football clubs',
description='Which football clubs are you aware of?',
categories=[
{'name': 'Not selected', 'id': 2, 'missing': False, "selected": False},
{'name': 'Selected', 'id': 1, 'missing': False, "selected": True}
],
subvariables=[
{'name': 'ManU', 'alias': 'ManU', 'id': 1},
{'name': 'Chelsea', 'alias': 'Chelsea', 'id': 2}
],
values=zip(values_ManU, values_Chelsea)
)
Note that the shape of the values
given to a new multiple_response
or categorical_array
is a list of lists. More specifically, it's a list of row values rather than a list of column values. Use the zip
function, as demonstrated in the above example, to reshape column-values to row-values if needed.
It's possible to move an existing group of categorical
variables into a new categorical_array
using move_to_categorical_array
. This will result in the creation of a new non-derived/material categorical_array
. If you need to create a new derived categorical_array
you will need to use bind_categorical_array
instead.
ds.move_to_categorical_array(
alias="my_array",
name="My array",
description="This is my array",
notes="",
subvariables=[cat1_alias, cat2_alias, cat3_alias]
)
This method is the equivalent of move_to_categorical_array
except that it uses an existing group of categorical
variables to create a new non-derived/material multiple_response
.
ds.move_to_multiple_response(
alias="my_multi",
name="My multi",
description="This is my multi",
notes="",
subvariables=[cat1_alias, cat2_alias, cat3_alias]
)
Add an existing categorical
variable to an existing categorical_array
or multiple_response
using move_as_subvariable
:
ds.move_as_subvariable(
destination=alias_array,
source=alias_categorical
)
It's possible to reorder the subvariables in an array using reorder_subvariables
. This can be used to reorder the subvariables in a categorical_array
or a multiple_response
and works just as well on a derived variable as it does a non-derived/material variable:
ds.reorder_subvariables(
subvariables=[subalias2, subalias1, subalias3]
)
It's possible to "unbind" all (and only all) of the subvariables in a non-derived/material categorical_array
using unbind
.
my_array = ds[alias]
my_array.unbind()
This will result all of the subvariables in the targeted array becoming independent categorical
variables on the dataset.
The values for specific variables can be changed explicitly (i.e. without using a filter expression or derivation to generate values independently for each row) using replace_values
. Using this method a variable is named and a value is given, which will then overwrite any data already in that variable.
Optionally, a filter expression can be provided, in which case the data will only be applied to rows satisfying that expression.
Multiple variables and values can be defined at the same time as long as the edits all conform to the same filter (if any).
In the following example any data in the variable weight
, where the variable segment
is either 2
or 5
, will be overwritten with the value 1
.
ds['weight'].replace_values(value=1, filter='segement in [2, 5]')
To remove data from a variable pass None
to it, as in the next example where any/all values that may have been in the weight
variable are being overwritten with nothing:
ds['weight'].replace_values(value=None)
To replace subvariable values use the same approach directly on the subvariable:
ds[alias][subalias].replace_values(value=None)
Scrunch provides a number of ways to add derived variables to a dataset.
See NOTE and safely_integrate_weight
example below before using this method
Add a derived weight to a dataset using derive_weight
.
Provide the required targets
as per the below example, a list in the form [{alias: targets}, ...]
. Each list of targets must sum exactly to 1
.
ds.derive_weight(
targets=[
{'gender' : {
1: .45,
2: .55}},
{'locality': {
1: .10,
2: .15,
3: .20,
4: .25,
5: .30}}
],
alias='weight_natrep',
name='NatRep weight'
)
This will add a new derived variable with the alias weight_natrep
to the dataset. As a derived variable its result will change with changes in the dataset (i.e. more rows or changed data in dependent variables). If you need to edit this variable in any way that is disallowed for derived variables or need the result to remain unchanged despite any other changes to the dataset going forward, you must integrate
it immediately afterwards.
NOTE: Be very careful using derive_weight
with exclusions active on the dataset!!
If any exclusions are active derive_weight
WILL populate weight factors for excluded cases. If those excluded cases are included again later the weighted results WILL BE WRONG. One way to overcome this complication is to use a function that will ensure your weights are consistently handled in the same way:
def safely_integrate_weight(ds, weight_filter=None, *args, **kwargs):
"""
Safely derive, then integrate a weight on ds by guarding any exclusions.
This function will:
1. Record the current exclusion on ds for later
2. Remove all exclusions on ds
3. (optional) Apply negated weight filter as exclusion on ds
4. Derive the weight using the given *args and **kwargs
5. Integrate the derived weight variable
6. Replace values in negated weight filter rows for the weight
variable in with No Data
7. Re-apply the original exclusion recorded earlier
Parameters
----------
ds : scrunch.datasets.Dataset
The targeted Scrunch dataset.
weight_filter : str, default=None
If given, a filter describing which rows in the dataset the weight
should be generated for. Other rows will end up with No Data in
the new weight variable.
*args : other arguments
These will be passed to ds.derive_weight().
**kwargs : other keyword arguments
These will be passed to ds.derive_weight().
Returns
-------
var : scrunch.Variable
The integrated weight variable.
"""
original_exclusion = ds.get_exclusion()
ds.exclude()
if weight_filter is not None:
not_weight_filter = 'not ({})'.format(weight_filter)
ds.exclude(not_weight_filter)
var = ds.derive_weight(*args, **kwargs)
var.integrate()
if weight_filter is not None:
var.replace_values(value=None, filter=not_weight_filter)
ds.exclude(original_exclusion)
This dataset method adds a new derived variable that can be either single (categorical
) or multiple-response (multiple_response
). When used to create a multiple_response
this method acts as a convenient wrapper around the more explicit derive_multiple_response
, which should be used if more control than this method allows for is required.
alias
, name
, description
and notes
will be inherited by the new variable directly.
multiple
controls whether or not the new variable should be created as a multiple_response
. If False
(the default) it will be created as a simple categorical
(single-response). If your case
statements are not mutually exclusive you should make sure multiple=True
in order to capture all of your case
statements correctly.
When a single-response (categorical
) variable is needed multiple
should be omitted or set to false (missing=False
).
categories
is a list of dictionaries with the required keys id
, name
and case
.
case
is the logical expression describing which respondents should be recoded into each id
of the new variable. The name
for each category will be shown in the UI.
numeric_value
may also be given to a category case
to set factors for each category if they need to differ from the id
(for example in the calculation of a mean).
new_var = ds.create_categorical(
alias='genage5',
name='Gender/Age (5 groups)',
description='Gender by 18-24/25-34/35-44/45-54/55+',
notes='All UK adults',
categories=[
{'id': 1, 'name': 'Men 18-24', 'case': 'gender == 1 and agerange == 1'},
{'id': 2, 'name': 'Men 25-34', 'case': 'gender == 1 and agerange == 2'},
{'id': 3, 'name': 'Men 35-44', 'case': 'gender == 1 and agerange == 3'},
{'id': 4, 'name': 'Men 45-54', 'case': 'gender == 1 and agerange == 4'},
{'id': 5, 'name': 'Men 55+', 'case': 'gender == 1 and agerange == 5'},
{'id': 6, 'name': 'Women 18-24', 'case': 'gender == 2 and agerange == 1'},
{'id': 7, 'name': 'Women 25-34', 'case': 'gender == 2 and agerange == 2'},
{'id': 8, 'name': 'Women 35-44', 'case': 'gender == 2 and agerange == 3'},
{'id': 9, 'name': 'Women 45-54', 'case': 'gender == 2 and agerange == 4'},
{'id': 10, 'name': 'Women 55+', 'case': 'gender == 2 and agerange == 5'}]
)
When a multiple-response (multiple_response
) variable is needed multiple
must be set to true (missing=True
).
You will also have the option to set the uniform_basis
to True
or False
. If uniform_basis=True
(the default) the percentage for each subvariable's selected categories will be calculated against the cumulative base for all of the subvariables. If uniform_basis=False
the percentages for each subvariable will instead be calculated independently (a so-called floating base).
muliple_response
variables are arrays and the categories
given will be interpreted as instructions as the array's subvariables
. The new multiple_response
array will be given the following default set of categories:
id | name | selected | missing |
---|---|---|---|
1 | Selected | True | False |
2 | Not Selected | False | False |
-1 | No Data | False | True |
In this case categories
is a list of dictionaries with the required keys id
, name
and case
.
id
will be combined with the alias of the new variable to uniquely name the multiple_response
array's subvariables
. For example when creating a new multiple_response
with the alias drinks
a "category" with the 'id': 1
will result in a subvariable
with the alias
drinks_1
.
case
indicates the condition required for that "category" (subvariable
) to be considered selected.
Optionally, missing_case
can be given and indicates the condition for that "category" (subvariable
) to be considered missing.
Rows in the data for each "category" (subvariable
), respectively, will be considered to be not selected if they satisfy neither case
nor missing_case
, or if they do not satisfy case
and missing_case
has not been given.
NOTE: Remember that not selected is still considered to be a valid response and is not missing, even though such cases will not be counted as selected for that "category" (subvariable
).
missing_case
can be given EITHER once to the method as a whole OR individually for each "category" (subvariable
).
If missing_case
is given to the method itself it will be applied automatically as the missing_case
for each "category" (subvariable
), as in the following example:
new_var = ds.create_categorical(
alias='drinks',
name='Preferred drinks',
description='Which drinks do you prefer?',
notes='',
multiple=True,
missing_case='missing(q1)',
uniform_basis=False,
categories=[
{'id': 1, 'name': 'Coke products', 'case': 'q1 in [1,2,3,4,5]'},
{'id': 2, 'name': 'Pepsi products', 'case': 'q1 in [6,7,8,9,10]'},
{'id': 3, 'name': 'No sugar products', 'case': 'q1 in [2,3,7,8]'},
{'id': 4, 'name': 'Diet products', 'case': 'q1 in [2,7]'},
{'id': 5, 'name': 'Diet alternative products', 'case': 'q1 in [3,8]'},
{'id': 6, 'name': 'Other', 'case': 'q1 in [11,12,13,14,15]'}
]
)
Alternatively, missing_case
can be given explicitly to all, some or none of the "categories" (subvariables
):
new_var = ds.create_categorical(
alias='drinks',
name='Preferred drinks',
description='Which drinks do you prefer?',
notes='',
multiple=True,
categories=[
{'id': 1, 'name': 'Coke products', 'case': 'q1 in [1,2,3,4,5]', 'missing_case': 'missing(q1)'},
{'id': 2, 'name': 'Pepsi products', 'case': 'q1 in [6,7,8,9,10]', 'missing_case': 'missing(q1)'},
{'id': 3, 'name': 'No sugar products', 'case': 'q1 in [2,3,7,8]', 'missing_case': 'missing(q1)'},
{'id': 4, 'name': 'Diet products', 'case': 'q1 in [2,7]', 'missing_case': 'missing(q1)'},
{'id': 5, 'name': 'Diet alternative products', 'case': 'q1 in [3,8]', 'missing_case': 'missing(q1)'},
{'id': 6, 'name': 'Other', 'case': 'q1 in [11,12,13,14,15]', 'missing_case': 'missing(q1)'}
]
)
This method creates a new single categorical derived variable, filling its values from other categorical variables.
It takes a variables
argument that indicates the cases and variables to fill. Similar to the create_categorical
responses object, this consists on a list of dictionaries with a case
key containing a filter expression and a variable
to indicate what column to use to fill.
An additional else
case is allowed, in order to fill up all other rows that don't satisfy any of the indicated cases. It is possible to fill else-rows with either a variable or a default category.
The following example creates a new variable "Pop drinking frequency" basing of the conditions of the pop_pref
variable. For all cases where it equals 1 (prefers Coke), use the coke_freq
responses, similarly for Pepsi. And everything else, default to the other_pop_freq
.
dataset.create_fill_values([
{"case": "pop_pref == 1", "variable": "coke_freq"},
{"case": "pop_pref == 2", "variable": "pepsi_freq"},
{"case": "else", "variable": "other_pop_freq"}
], alias="pop_freq", name="Pop drinking frequency")
Optionally, it is possible to declare a missing category, on this case, with code 99 that will map to "Not pop drinker".
dataset.create_fill_values([
{"case": "pop_pref == 1", "variable": "coke_freq"},
{"case": "pop_pref == 2", "variable": "pepsi_freq"},
{"case": "else", "name": "Not pop drinker", "id": 99, "missing": True}
], alias="pop_freq", name="Pop drinking frequency")
This method creates a new derived multiple_response
to the dataset with full control over the selected
/not selected
/missing
categories for each subvariable.
In the following example the categories
that in the new multiple_response
are giving explicitly. This allows full control over the selected
and missing
attributes of each. For each category 'selected': False
and 'missing: False'
will be assumed and should be overrided explicitly (as in the following, where 1
is selected
and 3
is missing
).
The variable's subvariables
must also be defined, including for each the case
expressions for each category. These subvariables will be what is displayed as if they were categories in the UI (in this case "Coke", "Pepsi", "Other", "None" and "Code or Pepsi").
alias
, name
, description
and notes
will be inherited by the new variable directly.
new_var = ds.derive_multiple_response(
name='Preferred drinks',
alias='drinks',
description='Which drinks do you prefer?',
notes='',
categories=[
{'id': 1, 'name': 'Selected', 'selected': True},
{'id': 2, 'name': 'Not Selected'},
{'id': -1, 'name': 'No Data', 'missing': True}
],
subvariables=[
{
'id': 1,
'name': 'Coke',
'cases': {
1: 'q1 in [1]',
2: 'not q1 in [1] and valid(q1)',
-1: 'missing(q1)',
}
},
{
'id': 2,
'name': 'Pepsi',
'cases': {
1: 'q1 in [2]',
2: 'not q1 in [2] and valid(q1)',
-1: 'missing(q1)',
}
},
{
'id': 3,
'name': 'Other',
'cases': {
1: 'q1 in [95]',
2: 'not q1 in [95] and valid(q1)',
-1: 'missing(q1)',
}
},
{
'id': 4,
'name': 'None',
'cases': {
1: 'q1 in [99]',
2: 'not q1 in [99] and valid(q1)',
-1: 'missing(q1)',
}
},
{
'id': 5,
'name': 'Coke or Pepsi',
'cases': {
1: 'q1 in [1, 2]',
2: 'not q1 in [1, 2] and valid(q1)',
-1: 'missing(q1)',
}
}
]
)
This method combines a number of categories on a given variable into a new derived variable (single or multiple response). A map
argument here takes the place of the case
statement from create_categorical
example because because the conditions for the new variable may only rely on the parent variable from which it is being created.
For example:
comb = ds.combine_categories(
ds['genage5'],
alias='Age5way',
name='5-way age',
description='18-24/25-34/35-44/45-54/55+',
categories={
1: '18-24',
2: '25-34',
3: '35-44',
4: '45-54',
5: '55+',
9: 'No answer'
},
map={
1: [1, 6],
2: [2, 7],
3: [3, 8],
4: [4, 9],
5: [5, 10],
},
default=9,
missing=[-1, 9]
)
alias
, name
and description
will be inherited by the new variable directly.
categories
is a dictionary of items in the form {id: name}
that will be used to generate the new variable's metadata.
map
is a dictionary equivalent to a shorthand case
statement, where the new variable's id
is keyed to one or more of the parent variable's id
s.
default
is where you can give the value that should be given if none of the items in the map
apply to a given row of data.
missing
indicates which categories, if any, should be set to missing
for aggregation purposes.
A new variable created using combine_categories
inherits the type of its parent, either categorical
or multiple_response
. This is because the same categories from the parent cannot be references more than once in the map
, which means that defining overlapping combinations of responses in your map
not allowed.
It's possible to "bind" a group of categorical
variables into a new categorical_array
using bind_categorical_array
. This will result in the creation of a new derived categorical_array
. You should integrate
the result if you later need to make non-derived edits to it, or alternatively use the method move_to_categorical_array
for creating a new non-derived/material categorical_array
.
ds.bind_categorical_array(
alias="my_array",
name="My array",
description="This is my array",
notes="",
subvariables=[
{"id": 1, "alias": cat1_alias},
{"id": 2, "alias": cat2_alias},
{"id": 3, "alias": cat3_alias}
]
)
A new numeric variable can be created using a derivation
expression.
ds.create_numeric(
alias='monthly_rent',
name='Monthly rent',
description='Rent paid per month',
notes='All UK adults',
derivation='(weekly_rent * 52) / 12'
)
It's possible to set the missing rules for a numeric
, text
or date
variable by using it's missing_rules
attribute, which behaves like a dict
.
Additionally, the variable has a set_missing_rules
method which can be used to set multiple rules at once (or none) by passing a dict
. Note that using set_missing_rules
will fully replace any rules that already sit on the variable.
Remember that missing rules for categorical
variables are handled directly by their categories
.
>>> var.missing_rules
{}
>>> var.missing_rules['invalid'] = -1
... var.missing_rules
{u'invalid': -1}
>>> var.missing_rules['NA'] = -2
... var.missing_rules
{u'invalid': -1, u'NA': -2}
>>> var.missing_rules['NA'] = -9
... var.missing_rules
{u'invalid': -1, u'NA': -9}
>>> del var.missing_rules['NA']
... var.missing_rules
{u'invalid': -1}
>>> rules = {'none': -8, 'unknown': -9}
... var.set_missing_rules(rules)
... var.missing_rules
{u'none': -8, u'unknown': -9}
>>> var.missing_rules.clear()
>>> var.missing_rules
{}
This method adds a copy of an existing variable to the dataset. In the following example a copy of genage5
is created:
new_var = ds.copy_variable(
ds['genage5'],
alias='genage5_copy',
name='Copy of Gender/Age (5 groups)'
derived=True
)
A copied variable cannot be integrated after it exists (using var.integrate()
). For that reason you need to decide at the time of its creation if it should be derived or not. However you can also choose to accept the derived status of the variable you're copying, whatever that is, by using derived=None
. In that case your copy will be derived only if the source variable was derived.
It's possible to convert a derived variable to a non-derived variable. This is known as "integrating" the derived variable and it cannot be undone.
var.integrate()
It's possible to cast a variable from one type to another. It may be useful to preview the result of doing this beforehand, which is what cast_summary
is for. For example, to see the summary of a numeric
version of a given categorical
variable (without actually creating the numeric version):
summary = ds.cast_summary(var_categorical.alias, 'numeric')
If you're happy with the result then you can add a new version of the source variable to the dataset using cast
:
var_numeric = ds.cast(var_categorical.alias, 'numeric')
After doing this both the original typed variable and the newly cast version of it will exist. If you no longer need the original you should hide
it.
NOTE: If you want to cast
a categorical
to a numeric
you must provide numeric_values
for the categories that you want to be cast
. Any categories without a numeric_value
will be missing in your new numeric
.
datetime
variables have a unique attribute known as "resolution". This attribute may describe one of two things:
- The time span units described explicitly by the case data (
datetime
is typically stored in milliseconds) - The rollup time span units used to render the
datetime
case data in the app (e.g. dates are typically rendered in days, weeks, quarters or years)
The Crunch app displays datetime
data by rendering the case data as if it were stored with the variable's rollup_resolution
. The default rollup_resolution
of a datetime
variable is the case data's actual resolution
.
Analyzing datetime
data in its typically millisecond resolution is often not useful and so the edit_resolution
method can be used to apply a rollup_resolition
using one of the following codes:
'Y': Year
'Q': Quater
'M': Month
'W': Week
'D': Day
'h': Hour
'm': Minute
'ms': Millisecond
For example, to render var
in the app in weeks:
var.edit_resolution('W')
Dataset variables behave as a Python dictionary in the sense of a Dataset being the structure and variables the elements or keys of the dictionary. Since dictionaries are referenced by key in this case the key is a variable label
and a value is an instance of the class Variable.
-
ds.keys()
→ returns a list of variablename
in the dataset -
ds.values()
→ returns a list ofVariables
in the dataset -
ds.items()
→ returns a list of tuples of type (key, value)
Given this interface we can access a Dataset variable by using the Python dictionary syntax:
var = ds['my_var']
Or iterate over a datasets’s variables:
for name, var in ds.items():
print(var.id)
Just like the variables of a dataset, subvariables are equally iterable.
var = ds['my_var']
subvar = var['my_subvariable']
Scrunch also exposes dictionary-like iteration for subvariables.
for subvar in var:
print(subvar.name)
Subvariables support the edit
method as Variables
do, for example:
subvar.edit(name='new name')
Categorical variables expose their categories as a CategoryList
instance which is a subclass of collections.OrderedDict
.
>>> var.categories
CategoryList([
(1, Category(name="Yes", id=1, numeric_value=None, missing=False, selected=False)),
(2, Category(name="No", id=2, numeric_value=None, missing=False, selected=False)),
(-1, Category(name="No Data", id=2, numeric_value=None, missing=True, selected=False))
])
It's possible to edit all attributes of the categories except their id
using the edit
method.
>>> var.categories[1].edit(name="Sí", numeric_value=1)
>>> var.categories[2].edit(numeric_value=2)
>>> var.categories
CategoryList([
(1, Category(name="Sí", id=1, numeric_value=None, missing=False, selected=False)),
(2, Category(name="No", id=2, numeric_value=2, missing=False, selected=False)),
(-1, Category(name="No Data", id=2, numeric_value=None, missing=True, selected=False))
])
Use the method order
, which takes a list
of int
or str
corresponding to either category id
or name
, to updates their order.
>>> var.categories.order(2, 1, -1)
>>> var.categories
CategoryList([
(2, Category(name="No", id=2, numeric_value=2, missing=False, selected=False)),
(1, Category(name="Sí", id=1, numeric_value=None, missing=False, selected=False)),
(-1, Category(name="No Data", id=2, numeric_value=None, missing=True, selected=False))
])
All unmentioned categories will remain at the end of the order, in their current order.
You can also quickly see the id
values for each category using list
:
>>> list(var.categories)
[2, 1, -1]
You can add new categories to an existing variable:
>>> var.add_category(
... id=5,
... name="No sé",
... numeric_value=None,
... missing=False,
... before_id=-1
... )
>>> var.categories
CategoryList([
(2, Category(name="No", id=2, numeric_value=2, missing=False, selected=False)),
(1, Category(name="Sí", id=1, numeric_value=None, missing=False, selected=False)),
(5, Category(name="No sé", id=5, numeric_value=None, missing=False, selected=False)),
(-1, Category(name="No Data", id=2, numeric_value=None, missing=True, selected=False))
])
You can also delete a category if necessary. It is recommended any data corresponding to such a category be removed beforehand (for example by using replace_values
):
var.categories[5].delete()
Variables can be organized into nested folders for convenience. A new Folder
class is introduced that contains methods to manipulate variables and subfolders.
The folders API is accessed via:
# Obtains the root folder for variables
root_folder = dataset.folders.root
The elements inside a folder are a mix of Variable
and Folder
instance, they are available under the .children
property.
children = root_folder.children
for child in children:
if isinstance(child, Variable):
print("I am variable %s" % child.name)
elif isinstance(child, Folder):
print("I am a subfolder %s" % child.name)
To organize the variables in folders, it is necessary to create subfolders using the make_subfolder
method in Folder
instances.
l1 = root_folder.create_folder("level 1")
l2 = l1.create_folder("level 2")
l3 = l2.create_folder("level 3")
It is possible to create subfolders in specific positions indicating before
or after
another element name (or variable alias) or by its particular location.
Additionally, it takes the alias
keyword that receives a list of aliases of the variables to be included as part of the newly created folder. They'll be moved out of their current location.
root = self.ds.folders.root
root.create_folder("p1")
root.create_folder("p2")
root.create_folder("p3")
root.create_folder("A", before="p2")
assert [c.name for c in root.children] == ["p1", "A", "p2", "p3"])
root.make_subfolder("B", after="p2")
assert [c.name for c in root.children] == ["p1", "A", "p2", "B", "p3"])
root.make_subfolder("C", position=3)
assert [c.name for c in root.children] == ["p1", "A", "C", "p2", "B", "p3"])
Subfolders are always created as the last element of the parent folder. It is possible to get a deeply nested subfolder or variable by using its |
separated path.
l3 = root_folder.get("| level 1 | level 2 | level 3")
l3.path # prints "| level 1 | level 2 | level 3"
To put variables or folders inside other folders, the move_here
accepts a list of instances and appends them to the target folder.
folder_a = root_folder.make_subfolder("A")
folder_b = root_folder.make_subfolder("B")
age_var = dataset['Age']
gender_var = dataset['Gender']
folder_a.move_here(age_var, folder_b, gender_var)
folder_b.move_here(age_var)
assert [c.name for c in folder_a.children] == ["B", "Gender"]
assert [c.name for c in folder_b.children] == ["Age"]
It is also possible to move variables by referencing them by alias
folder_a.move_here("Age", folder_b, "Gender")
Additionally, move_here
accepts the before
, after
and position
argument to place the newly moved items in a particular location in the graph.
To reorder the contents of a folder, the reorder
method accepts a list of the folder children and updates the order
children = folder_a.children
assert [c.name for c in folder_a.children] == ["B", "Gender"]
folder_a.reorder(reversed(children))
assert [c.name for c in folder_a.children] == ["Gender", "B"]
folder_a.reorder(items=children)
assert [c.name for c in folder_a.children] == ["B", "Gender"]
It is also possible to send the name strings of the contents of the folder for reordering.
To change the name of a folder, use the rename
method
assert folder_a.name == "A"
folder_a.rename("Renamed")
assert folder_a.name == "Renamed"
assert folder_a.parent.get("Renamed").url == folder_a.url
Action | Orders API | Folders API | Details |
---|---|---|---|
Obtain the root element | root = dataset.order['|'] |
root = dataset.folders['|'] |
|
Fetch a sub folder/group | root['|path | group'] |
root['|path | folder'] |
|
Make a new group/subfolder | group.create_group() |
folder.create_folder() |
Both take after , before , position and alias optional arguments |
Change the name of a group | group.rename() |
folder.rename() |
Both take string as the new name |
Add to the end of group | group.append() |
folder.append() |
Receives either a list of strings (variable alias) or of Folder and Variable instances that will be added to the end of the target folder. In the folders API this is a wrapper around the .move_here() method |
Add on arbitrary position | group.insert() |
folder.insert() |
Another wrapper of Folder.move_here() that takes a list of variable alias or instances of Folder and Variable as well as a position argument that will move the elements to the desired folder. |
Move a variable to a group | variable.move(path) |
variable.move_to_folder(path) |
Both take the same path as target folder argument and optional before , after and position arguments to indicate the destination position of the single variable. This is also a wrapper for Folder.move_here()
|
Move many elements to a group | not available |
folder.move_here(elements) |
Allows to move several items to a folder in one call. It takes either a list of strings (variable aliases) or Folder and Variable instances. By default they will be appended at the end of the folder. It optionally takes the before , after and position arguments to indicate the location of the new elements. |
Reorder the contents of a group | group.reorder() |
folder.reorder() |
The folders API takes a list of strings (that can be variable alias/names or folder names that are currently part of the folder and will reorder the contents accordingly. This also takes a list of Folder and Variable instances to reorder, the same outpuf of folder.children
|
The list of elements of a group | list(group) |
folder.children |
There is a specific .children method that returns a list of Variable or Folder instances of the contents |
CrunchBoxes are snapshots of a dataset that can appear in a web page outside of the Crunch App. The data captured in a CrunchBox can be multidimensional in that it can show more than just one piece of data from the dataset (end users can interact with it to change the view they are looking at).
After being created, CrunchBox data becomes static regardless of changes later made to its parent dataset. For example if new rows are appended to the parent dataset or its data is edited in some way those changes will not be expressed in a CrunchBox that was created before those changes were made. In order to take advantage of changes in the parent dataset the CrunchBox would need to be re-created.
Create a new CrunchBox by using the create_crunchbox
method on the dataset. create_crunchbox
will return an instance of scrunch.datasets.CrunchBox
, see below for further details.
>>> cbox = s.create_crunchbox(
... title='Short descriptor',
... header='Header information for the CrunchBox',
... footer='Footer information for the CrunchBox',
... notes='Other information relevent for this CrunchBox',
... variables='List of variable aliases or `Variable` instances (None=all)',
... filters='List of filter names or Filter instances',
... min_base_size=None # min sample size to display values in graph,
... palette=None # dict of colors,
... force=False
... )
The given varaibles
will be available as either primary or secondary variables in the CrunchBox, meaning any of them can be used as either the main variable or as a comparison on the main variable. Since the CrunchBox is an interactive tool its end users will decide on-the-fly how they want to use each variable included in this list.
The given filters
must be filters that already exist on the dataset - this is not a list of variable aliases.
Further information on providing a custom palette
can be found at http://docs.crunch.io/.
View the CrunchBoxes available on the dataset:
>>> ds.crunchboxes
[<CrunchBox: title='Short descriptor'; id='7e7e275378d8fdb63715c9053409c50e'>]
This list can also be indexed or iterated and each item is an instance of scrunch.datasets.CrunchBox
, see below for further details.
CrunchBox
instances have the following attributes and methods:
>>> cbox.widget_url # the widget url
'https://s.crunch.io/widget/index.html#/ds/7e7e275378d8fdb63715c9053409c50e/'
>>> cbox.url # the api url
'https://app.crunch.io/api/datasets/4f2ebd029ff2414191537eff6037099f/boxdata/7e7e275378d8fdb63715c9053409c50e/'
>>> cbox.iframe() # display the full iframe html code
'<figure style="text-align:left;" class="content-list-component image"> <div style="padding-bottom: 12px"> <span style="font-size: 18px; color: #444444; line-height: 1;">Human friendly identifier</span> </div></figure><iframe src="https://s.crunch.io/widget/index.html#/ds/7e7e275378d8fdb63715c9053409c50e/" width="600" height="480" style="border: 1px solid #d3d3d3;"></iframe>'
>>> cbox.remove() # permanently delete the CrunchBox
Exporting datasets is as easy as:
CSV export (the default):
ds.export(path='/Downloads/ds.csv')
SPSS export:
ds.export(path='/Downloads/ds.sav', format='spss')
The export can also be refined in a number of ways. filter
provides the ability to export a subset of rows and variables
allows a subselection of variables. hidden=True
allows you to export variables that are currently hidden (variable attribute discarded=True
). timeout
allows you to specify the number of seconds that should be allowed for the export to complete. If the export doesn't complete in that time a TaskProgressTimeoutError
will be raised. The default timeout
is None
, which means the export will never time out.
ds.export(
path='/Dowloads/ds.csv',
filter='var1 == 4',
variables=['var1, var2'],
hidden=True,
timeout=60
)
filter
may be a logical expression or an actual dataset Filter
referenced as per ds.filters["myfilter"]
.
variables
is a list of variable of alias
, id
or variable objects.
When variables
is given hidden
is ignored.
timeout
is either None
(by default) or an int
representing the number of seconds before timeout.
Additionally, an options
argument in the export
method accepts a dict
for controlling the same export options seen in the Crunch UI. The default options
values for each export format are:
CSV export default options
:
options = {
'use_category_ids': True
}
SPSS export default options
:
options = {
'prefix_subvariables': False,
'var_label_field': 'description'
}
Example:
ds.export(
path='/Downloads/ds.sav',
format='spss',
options={
'prefix_subvariables': True
}
)