User Reference

Scrunch is a Pythonic scripting library for cleaning data in Crunch. The following document describes the basic usage of scrunch.

Connect to the Crunch server

Let’s start with a simple session and getting a Crunch dataset.

`connect`

Use the connect method to authenticate with the Crunch server.

from scrunch import connect, get_dataset
connection = connect(api_key="<your api key>", site_url="https://your-domain.crunch.io/api/")

Or, if you don't have an API Key:

connection = connect(user="[email protected]", pw="yourpassword", site_url="https://your-domain.crunch.io/api/")

The library will automatically detect the provided credentials and will create a persistent session from this point on.

Default connection

If an explicit call to connect() hasn't been made, the first call to get_dataset() or get_project() will establish a default connection to the default Crunch host.

The default connection will first look at environment variable CRUNCH_API_KEY or a crunch.ini file in the local path with the following format:

[DEFAULT] CRUNCH_API_KEY=apikeysecret

Get a dataset

When we get a dataset the existing connection is used (see above).

`get_dataset`

Getting a dataset as simple as:

ds = get_dataset("Dataset X")

Sometimes datasets will belong to a project. In this case we need to pass the project parameter with the name or the id of the project to the get_dataset function.

`project`

ds = get_dataset("Dataset X", project="Project Y")

Get a project

When we get a project the existing connection is used (see above).

`get_project`

Getting a project as simple as:

pr = get_project("Project X")

`create_project`

You can create a nested project by using create_project:

pr.create_project("Project Y")

`rename`

You can rename an existing subfolder using rename:

pr.rename("Project Z")

`delete`

You can delete an empty project by using delete:

pr.delete()

`get`

If you know the name or id of a nested project you can get it in the following way:

pr_b = pr_a.get('Project B')

`children`

You can iterate over the contents of a project by usingchildren. Items here will be either datasets or projects (which can themselves be iterated over in turn):

for child in pr.children:
    if isinstance(child, Dataset):
        print("I am dataset: %s" % child.name)
    elif isinstance(child, Project):
        print("I am a project: %s" % child.name)

`move_here`

To add datasets or projects to a project use move_here. By default the given items will be appended at the end. Optionally, a specific position, or a relative before or after position, can be given:

pr.move_here(items=[ds])
pr.move_here(items=[ds1, ds2], position=0)
pr.move_here(items=[ds1, pr1], before='Dataset X')
pr.move_here(items=[ds1, ds2], after='Dataset Y')

Note that items must always be given as a list even if there is only one item.

`get_personal_project`

To move a dataset into your Personal project, follow a similar approach to what is described above except use get_personal_project to initially get the project. With this approach the dataset will be removed from its current project and anyone for whom the dataset has been shared directly will be able to find it in their Personal Project:

pr = scrunch.get_personal_project()

`members`

To see what members have access to the given project you need to access the Members instance on each project using:

members = pr.members

Members

`list`

To see the email addresses of all members attached to the project use list:

>>> pr.members.list
[u'[email protected]',
 u'[email protected]']

`add`, `remove`

Add or remove members of the project in the following way:

pr.members.add('[email protected]')
pr.members.remove('[email protected]')

`edit`

To change the permissions of a member:

pr.members.edit('[email protected]', {'team_admin': True})
pr.members.edit('[email protected]', {'edit': True})

Teams

Teams are a convenient way to manage groups of users as a single unit.

You can share datasets with teams as you can with a single user. You can also add a team as a member of a project.

`create_team`

Create a new team (you will be the only team member and you will be set as the team's admin:

hogwarts = scrunch.create_team("Hogwarts")

`get_team`

You can get an existing team in the following way:

hogwarts = scrunch.get_team("Hogwarts")

`members`, `list`, `add`, `remove`, `edit`

You can access and edit the members of a team in the following way. If you add a member using edit=True that team member will also become a team admin.

>>> members = hogwarts.members
>>> members.list()
[u'[email protected]']

>>> hogwarts.members.add('[email protected]', edit=True)
>>> hogwarts.members.add('[email protected]')
>>> hogwarts.members.add('[email protected]')
>>> hogwarts.members.add('[email protected]')
>>> hogwarts.members.list()
[u'[email protected]',
 u'[email protected]',
 u'[email protected]',
 u'[email protected]',
 u'[email protected]']

>>> hogwarts.members.remove('[email protected]')
>>> hogwarts.members.list()
[u'[email protected]',
 u'[email protected]',
 u'[email protected]',
 u'[email protected]']
 
 >>> hogwarts.members.edit('[email protected]', permissions={'team_admin': True})

Editorship / ownership

One crucial step when manipulating a dataset is that you need to be the dataset's current editor. There are two ways to achieve this.

`get_dataset(..., editor)`

Either take editorship when you get the dataset:

ds = get_dataset("Example Dataset", editor=True)

`change_editor`

Or set yourself as the editor by calling:

ds.change_editor(user='[email protected]')

`edit`

Becoming the editor will allow you to edit the dataset name and description:

ds.edit(name='Brand Tracker', description='Waves 1-50', notes='Results current as of 24 Nov 2016')

`add_user`, `edit_user`, `remove_user`

You can also add new users (with editor or viewer privileges), edit the privileges of existing user or remove an existing user's access to the dataset completely:

# add a new user with editor privileges
ds.add_user(email, edit=True)

# add multiple new users with a list
ds.add_user([email_1, email_2], edit=True)

# change an existing user's privileges to viewer
ds.edit_user(email, edit=False)

# remove an existing user
ds.remove_user(email)

`editor`

You can also check who the current editor is:

>>> ds.editor

"<User: email='[email protected]'; id='892397ace5154b41b99b2dc04bd3ffc4'>"

`owner`

Information on current owner is available in a similar way:

>>> ds.owner

"<User: email='[email protected]'; id='f0df1fcb2d1e406d8361e6403d7b645d'>"

`change_owner`

The owner of a dataset can be either a user or a project. In both cases the change_ownermethod is what you need to pass ownership between entities:

ds.change_owner(user='[email protected]')
ds.change_owner(project='Coca-Cola')

Settings

The settings of the dataset can be accessed via the settings property:

>>> ds.settings
{
    'viewers_can_export': False,
    'min_base_size': 0,
    'weight': None,
    'viewers_can_change_weight': False
}

The dataset.settings property is a read-only dictionary.

`change_settings`

Modification of settings is done through the Dataset.change_settings method, which currently takes the viewers_can_export and viewers_can_change_weight parameters in order to change the values of those dataset settings:

ds.change_settings(viewers_can_export=True)
ds.change_settings(viewers_can_export=False, viewers_can_change_weight=True)

The rest of the settings are currently read-only. Trying to change invalid or read-only settings raises a ValueError exception.

`weights`

The list of known weight variables can be accessed with the following code:

>>> ds.weights
[u'weight', u'Weight-age-nets']

Savepoints

Savepoints works as snapshots of the current state of the dataset. You can return to a savepoint later to undo whatever has been done to the dataset in the meantime.

`create_savepoint`

You can create savepoints with the following:

ds.create_savepoint('savepoint description')

`load_savepoint`

To revert changes and return to the previous savepoint do the following:

ds.load_savepoint('savepoint description')

When loading a savepoint all savepoints after the one that is being loaded will be lost permanently.

Forks

While editing the dataset you may want to work on a fork, which is a copy of the dataset that can be merged back onto the original later. The parent dataset keeps a record of any forks that may have been created from it.

`fork`

Starting with the instance of a dataset, use the fork() method to create a fork.

ds_fork = ds.fork()

The resulting object is also a scrunch.datasets.Dataset instance and you can use it as with any other dataset instance. The fork

`name`, `description`

The fork can be given a name and a description if desired.

ds_fork = ds.fork(name="Fork #1", description="Working version")

If name is not given then the one will be parsed automatically from ds.name in the form "FORK #{} of {}".format(nforks + 1, ds.name), where nforks is the current number of forks attached to ds. It is usually better to allow the fork name to be parsed in this way. Provide a less technical description if needed.

`preserve_owner`

The owner of the fork is the user that created it unless the preserve_owner parameter is set to True, in which case the fork will be owned by the same User as its parent dataset:

ds_fork = ds.fork(preserve_owner=True)

If the owner of the parent dataset is a Crunch Project (instead of a regular Crunch User), then the ownership will be preserved and the fork will be part of the same Project as its parent dataset, regardless of the preserve_owner parameter.

`delete`

The same way you can delete a dataset, the fork can be deleted calling the .delete() method:

ds_fork.delete()

`delete_forks`

It is also possible to delete all forks, just use the .delete_forks() method from the dataset. Note that here we are using the ds object, not the ds_fork.`

ds.delete_forks()

`merge`

It's also possible to merge a fork back into its parent dataset using the merge method. This can be done by providing either the fork id, it's name or in cases where the fork name has been auto-parsed (as described above) then the fork number can be given either as int or str.

ds.merge(ds_fork.id)
ds.merge("Fork #1")
ds.merge(1)
ds.merge("1")

`autorollback`

Sometimes when a fork is merged back into its parent dataset conflicts between the two are exposed. autorollback provides some control in this situation. This is an optional argument that is True by default.

ds.merge(1, autorollback=False)

If True the parent dataset will be rolled back to the previous savepoint in case of a merge conflict. If False the parent dataset and fork will being left as-is.

Logical expressions

Logical expressions can be used in any place where a description or query against the dataset is needed. They can be composed using the following logical operators:

operator	meaning
==	equal
!=	unequal
>	greater than
>=	greater or equal
<	less than
<=	less or equal
and	logical and
or	logical or
in	in list/tuple
any	any(list/tuple)
all	all(list/tuple)
not	not in/any/all
r	r(lower, upper)

`r()`

The r() helper function declares a range of integers but it must be wrapped in a brackets [].

The expression [r(1,3), 6, 7] automatically translates to [1, 2, 3, 6, 7].

For example:

"age in [r(16,24)]"
"q1.any([r(1,1000)]"

compound expressions

Compound expressions are also possible, such as:

"disposition != 0 or exit_status != 1"
"(age in [r(35, 54)] and gender == 1) or region == 'London'"

dates

Date expressions are supported as long as they are ISO8601 compliant. That means that you may also provide only a partial representation of a complete ISO8601 datetime string as in the following examples:

ds.exclude('starttime < "2015-01-01T00:00:00+00:00"') # complete iso8601 string
ds.exclude('starttime < "2015-01-01T12:30"')          # day, hour, minute
ds.exclude('starttime < "2015-01-01T12"')             # day, hour
ds.exclude('starttime < "2015-01-01")')               # day only

Deleting rows (permanently)

It's possible to permanently delete rows in the dataset using drop_rows. This cannot be undone except by reverting to a previous savepoint.

`drop_rows`

To permanently delete rows use the following

ds.drop_rows("disposition != 1")

Exclusion filters (aka: drop rules)

Exclusion filters are used to hide invalid data. In the context of a survey this could be respondents who spent too little time answering it ("speeders"), rows with inconsistent data, etc.. In Crunch, this is done using exclusion filters, which are specified using a logical expression.

`exclude`

For example, assume that we have a variable with the alias "disposition" in the dataset. Then apply the exclusion filter:

ds.exclude("disposition != 'complete'")

In the above example disposition is a categorical variable and "complete" is the name of one of its categories.

An equivalent expression could reference the id (or code) assigned to the that category instead:

ds.exclude("disposition != 1")

It's possible to filter using two variables. This will compare the categorical names in the two variables:

ds.exclude("income == household_income")

We can also exclude a list referencing either id or name and using either brackets [] or parenthesis ():

ds.exclude("disposition in [0, 2]")
ds.exclude("disposition in ('incomplete', 'screenout')")

`missing`, `valid`

If you want to exclude all missing or valid values for a given variable you can achieve that with the following examples:

ds.exclude("missing(income)")
ds.exclude("valid(income)")

`get_exclusion`

You can also check the exclusion filter on a dataset at any time with the following:

>>> ds.exclude('disposition != 1')
>>> ds.get_exclusion()

"disposition != 1"

This will print the exclusion filter but for a complicated filter the parentheses may be different between what was provided and the get_exclusion method. Rest assured the meaning of the exclusion filter is not affected by this difference.

This method can also be used to extend your filters in a convenient way:

>>> ds.exclude('disposition != 1')
>>> ds.exclude(ds.get_exclusion()  + 'exit_status >= 1'])
>>> ds.get_exclusion()

"disposition != 1 and exit_status >= 1"

Dataset filters

Datasets allow viewers and editors to add custom filters which can be reused on many features of the dataset. A filter is defined by a logical expression and is attached to the dataset.

`add_filter`

In order to add a filter to a dataset:

ds.add_filter(expr='age > 21 and race == 3', name='myfilter')

`public`

The filter can be made public, meaning that all users of the dataset will be able to apply see and apply it if desired.

ds.add_filter(expr='age > 21 and race == 3', name='myfilter', public=True)

`edit`

Navigating a dataset's existing filters is done via the filters property.

ds.filters['myfilter']

In this way they are also editable:

ds.filters['myfilter'].edit(
    name='18+'
    expression='age > 18 and race == 2',
    public=False
)

Compare, Append & Join datasets

It's possible to append cases from one dataset onto another. It's also possible to join variables from one dataset onto another by matching a unique key in each. However, both of these operations rely on the datasets involved sharing the same metadata in order to avoid conflicts. To discover such conflicts ahead of time it's useful to compare them first.

Let's use the following datasets in our examples:

ds1 = get_dataset('Dataset X')
ds2 = get_dataset('Dataset Y')

`compare_dataset`

Compare the metadata between the two datasets first with the following:

ds1.compare_dataset(ds2)

This will return for you details of any discovered conflicts that may cause problems in a append_dataset or join operation. If any such conflicts are found you will need to use what is returned by compare_dataset to manually resolve those conflicts before continuing on to join or append them.

`append_dataset`

Use the append_dataset method to append rows from one dataset onto another.

ds1.append_dataset(
    dataset, 
    filter="Wave in [3, 4, 5]", 
    variables=None,
    autorollback=True
)

dataset is the dataset instance to append from.

filter is a logical expression for filtering rows to be appended from the incoming dataset.

variables a list of variable names to append data onto from the incoming dataset.

autorollback if True and the operation fails, the dataset will be roll back to its pre-append state automatically.

These last 3 arguments are optional.

`join`

Use the join method to join variables from one dataset onto another by matching on a unique key.

ds1.join(
    left_var='uid',
    right_ds=ds2,
    right_var='uid',
    columns=['languagepreference'],
    filter="languagepreference == 3",
    wait=True
)

columns is a list of variables we want to join in the dataset.

filter is a logical expression for filtering rows in the join operation.

wait indicates that the method should wait for Crunch to finish the join or return a URL with the progress indicator.

These last 3 arguments are optional.

Access and mutate variables

Variables are available as members of the dataset object, like a dictionary. Like datasets they expose some attributes that can be changed and others that that can only be read.

Editable attributes are: name, description, view, notes, format and uniform_basis.

The immutable attributes are: id, type, categories, discarded* and url.

Special attributes: alias is editable but only for non-streaming datasets.

uniform_basis is a special attribute for multiple_response variables. If uniform_basis=True (the default) the percentage for each subvariable's selected categories will be calculated against the cumulative base for all of the subvariables. If uniform_basis=False the percentages for each subvariable will instead be calculated independently (a so-called floating base).

`edit`

To edit them the Variable class provides the edit method:

ds['var'].edit(name='a short name', description='a different description')

`hide`, `unhide`

The discarded attribute is actually edited using the dedicated helper methods hide and unhide:

ds['var'].hide()
ds['var'].unhide()

Headings and subtotals

categorical variables can be decorated with headings and subtotals, which are special elements that can appear on top of, within or below the variable's categories.

`add_heading`

A heading is simply a piece of text. Add a heading with the following.

var = var.add_heading(name='Coke products', categories=1, anchor="top")
var = var.add_heading(name='Sprite products', categories=5, anchor="top")

As you can see in the above example, add_heading() returns the edited version of the variable. This is very important because if you need to continue working with the variable, including by adding additional headings or subtotals, you must at each step use the most recently returned version of the variable. Otherwise, you will only end up with the last heading or subtotal you added.

name is the text that will be inserted at the indicated position.

categories is where you indicate which category is the reference point for the anchor.

anchor is where you indicate if the heading should appear at the "top" or "bottom" of the named category.

`add_subtotal`

A subtotal is the auto-aggregation of a number of categories. Add a subtotal with the following.

var = var.add_subtotal(name='Coke products', categories=[1,2,3,4], anchor="top")
var = var.add_subtotal(name='Sprite products', categories=[5,6,7], anchor="top")

As you can see in the above example, add_subtotal() returns the edited version of the variable. This is very important because if you need to continue working with the variable, including by adding additional headings or subtotals, you must at each step use the most recently returned version of the variable. Otherwise, you will only end up with the last heading or subtotal you added.

name is the text that will be inserted at the indicated position.

categories is where you indicate which categories should be aggregated together into a new category-like element on the variable.

anchor is where you indicate if the subtotal should appear at the "top" or "bottom" of the named categories.

`transformations`

You can see what transformations exist on a variable (that has been returned by either add_heading or add_subtotal!):

>>> var.transformations()
[
    {
        u'function': u'subtotal', 
        u'args': [1, 2, 3, 4], 
        u'anchor': 'Top', 
        u'name': u'Coke products'
    }, 
    {
        u'function': u'subtotal', 
        u'args': [5, 6, 7], 
        u'anchor': 'Top', 
        u'name': u'Sprite products'
    }
]

Clear headings/subtotals

To clear all headings/subtotal transformations on a variable use either of the adding methods by passing catetories=None, for example:

>>> var.add_heading(name="", categories=None)
>>> var.transformations()
[]

Material variables

`create_variable`

This dataset method adds a new material (non-derived) variable by uploading a list of values that will be used to populate it immedately.

For example:

new_var = ds.create_variable(
    var_type='categorical',
    alias='football_clubs',
    name='Football clubs',
    description='Which football club do you follow?',
    categories=[
        {'name': 'ManU', 'id': 1, 'numeric_value': 1, 'missing': False},
        {'name': 'Chelsea', 'id': 2, 'numeric_value': 2, 'missing': False},
        {'name': 'Totthenham', 'id': 3, 'numeric_value': 3, 'missing': False}
    ],
    values = [2,1,3,1,2,2,3]
)

Supported var_types: text, numeric, categorical, datetime, multiple_response or categorical_array.

alias, name, description and notes will be inherited by the new variable directly.

When var_type is categorical or categorical_array, categories must be given. categories is a list in the form given in the above example.

When var_type is multiple_response the categories given must indicated a boolean value for selected as per the below example. This is what differentiates a multiple_response from a categorical_array.

When var_type is multiple_response or categorical_array, subvariables must be given. subvariables is a list in the form:

new_var = ds.create_variable(
    var_type='multiple_response',
    alias='football_clubs',
    name='Football clubs',
    description='Which football clubs are you aware of?',
    categories=[
        {'name': 'Not selected', 'id': 2, 'missing': False, "selected": False},
        {'name': 'Selected', 'id': 1, 'missing': False, "selected": True}
    ],
    subvariables=[
        {'name': 'ManU', 'alias': 'ManU', 'id': 1},
        {'name': 'Chelsea', 'alias': 'Chelsea', 'id': 2}
    ],
    values=zip(values_ManU, values_Chelsea)
)

Note that the shape of the values given to a new multiple_response or categorical_array is a list of lists. More specifically, it's a list of row values rather than a list of column values. Use the zip function, as demonstrated in the above example, to reshape column-values to row-values if needed.

`move_to_categorical_array`

It's possible to move an existing group of categorical variables into a new categorical_array using move_to_categorical_array. This will result in the creation of a new non-derived/material categorical_array. If you need to create a new derived categorical_array you will need to use bind_categorical_array instead.

ds.move_to_categorical_array(
    alias="my_array",
    name="My array",
    description="This is my array",
    notes="",
    subvariables=[cat1_alias, cat2_alias, cat3_alias]
)

`move_to_multiple_response`

This method is the equivalent of move_to_categorical_array except that it uses an existing group of categorical variables to create a new non-derived/material multiple_response.

ds.move_to_multiple_response(
    alias="my_multi",
    name="My multi",
    description="This is my multi",
    notes="",
    subvariables=[cat1_alias, cat2_alias, cat3_alias]
)

`move_as_subvariable`

Add an existing categorical variable to an existing categorical_array or multiple_response using move_as_subvariable:

ds.move_as_subvariable(
    destination=alias_array,
    source=alias_categorical
)

`reorder_subvariables`

It's possible to reorder the subvariables in an array using reorder_subvariables. This can be used to reorder the subvariables in a categorical_array or a multiple_response and works just as well on a derived variable as it does a non-derived/material variable:

ds.reorder_subvariables(
    subvariables=[subalias2, subalias1, subalias3]
)

`unbind`

It's possible to "unbind" all (and only all) of the subvariables in a non-derived/material categorical_array using unbind.

my_array = ds[alias]
my_array.unbind()

This will result all of the subvariables in the targeted array becoming independent categorical variables on the dataset.

`replace_values`

The values for specific variables can be changed explicitly (i.e. without using a filter expression or derivation to generate values independently for each row) using replace_values. Using this method a variable is named and a value is given, which will then overwrite any data already in that variable.

Optionally, a filter expression can be provided, in which case the data will only be applied to rows satisfying that expression.

Multiple variables and values can be defined at the same time as long as the edits all conform to the same filter (if any).

In the following example any data in the variable weight, where the variable segment is either 2 or 5, will be overwritten with the value 1.

ds['weight'].replace_values(value=1, filter='segement in [2, 5]')

To remove data from a variable pass None to it, as in the next example where any/all values that may have been in the weight variable are being overwritten with nothing:

ds['weight'].replace_values(value=None)

To replace subvariable values use the same approach directly on the subvariable:

ds[alias][subalias].replace_values(value=None)

Derived variables

Scrunch provides a number of ways to add derived variables to a dataset.

`derive_weight`

See NOTE and safely_integrate_weight example below before using this method

Add a derived weight to a dataset using derive_weight.

Provide the required targets as per the below example, a list in the form [{alias: targets}, ...]. Each list of targets must sum exactly to 1.

ds.derive_weight(
    targets=[
        {'gender' : {
            1: .45, 
            2: .55}},
        {'locality': {
            1: .10, 
            2: .15, 
            3: .20, 
            4: .25, 
            5: .30}}
    ],
    alias='weight_natrep',
    name='NatRep weight'
)

This will add a new derived variable with the alias weight_natrep to the dataset. As a derived variable its result will change with changes in the dataset (i.e. more rows or changed data in dependent variables). If you need to edit this variable in any way that is disallowed for derived variables or need the result to remain unchanged despite any other changes to the dataset going forward, you must integrate it immediately afterwards.

NOTE: Be very careful using derive_weight with exclusions active on the dataset!!

If any exclusions are active derive_weight WILL populate weight factors for excluded cases. If those excluded cases are included again later the weighted results WILL BE WRONG. One way to overcome this complication is to use a function that will ensure your weights are consistently handled in the same way:

def safely_integrate_weight(ds, weight_filter=None, *args, **kwargs):
    """
    Safely derive, then integrate a weight on ds by guarding any exclusions.

    This function will:
        1. Record the current exclusion on ds for later
        2. Remove all exclusions on ds
        3. (optional) Apply negated weight filter as exclusion on ds
        4. Derive the weight using the given *args and **kwargs
        5. Integrate the derived weight variable
        6. Replace values in negated weight filter rows for the weight
           variable in  with No Data
        7. Re-apply the original exclusion recorded earlier

    Parameters
    ----------
    ds : scrunch.datasets.Dataset
        The targeted Scrunch dataset.
    weight_filter : str, default=None
        If given, a filter describing which rows in the dataset the weight
        should be generated for. Other rows will end up with No Data in
        the new weight variable.
    *args : other arguments
        These will be passed to ds.derive_weight().
    **kwargs : other keyword arguments
        These will be passed to ds.derive_weight().

    Returns
    -------
    var : scrunch.Variable
        The integrated weight variable.
    """

    original_exclusion = ds.get_exclusion()
    ds.exclude()
    if weight_filter is not None:
        not_weight_filter = 'not ({})'.format(weight_filter)
        ds.exclude(not_weight_filter)
    var = ds.derive_weight(*args, **kwargs)
    var.integrate()
    if weight_filter is not None:
        var.replace_values(value=None, filter=not_weight_filter)
    ds.exclude(original_exclusion)

`create_categorical`

This dataset method adds a new derived variable that can be either single (categorical) or multiple-response (multiple_response). When used to create a multiple_response this method acts as a convenient wrapper around the more explicit derive_multiple_response, which should be used if more control than this method allows for is required.

alias, name, description and notes will be inherited by the new variable directly.

multiple controls whether or not the new variable should be created as a multiple_response. If False (the default) it will be created as a simple categorical (single-response). If your case statements are not mutually exclusive you should make sure multiple=True in order to capture all of your case statements correctly.

`categorical`

When a single-response (categorical) variable is needed multiple should be omitted or set to false (missing=False).

categories is a list of dictionaries with the required keys id, name and case.

case is the logical expression describing which respondents should be recoded into each id of the new variable. The name for each category will be shown in the UI.

numeric_value may also be given to a category case to set factors for each category if they need to differ from the id (for example in the calculation of a mean).

new_var = ds.create_categorical(
    alias='genage5', 
    name='Gender/Age (5 groups)',
    description='Gender by 18-24/25-34/35-44/45-54/55+',
    notes='All UK adults',
    categories=[
        {'id': 1, 'name': 'Men 18-24', 'case': 'gender == 1 and agerange == 1'},
        {'id': 2, 'name': 'Men 25-34', 'case': 'gender == 1 and agerange == 2'},
        {'id': 3, 'name': 'Men 35-44', 'case': 'gender == 1 and agerange == 3'},
        {'id': 4, 'name': 'Men 45-54', 'case': 'gender == 1 and agerange == 4'},
        {'id': 5, 'name': 'Men 55+', 'case': 'gender == 1 and agerange == 5'}, 
        {'id': 6, 'name': 'Women 18-24', 'case': 'gender == 2 and agerange == 1'},
        {'id': 7, 'name': 'Women 25-34', 'case': 'gender == 2 and agerange == 2'},
        {'id': 8, 'name': 'Women 35-44', 'case': 'gender == 2 and agerange == 3'},
        {'id': 9, 'name': 'Women 45-54', 'case': 'gender == 2 and agerange == 4'},
        {'id': 10, 'name': 'Women 55+', 'case': 'gender == 2 and agerange == 5'}]
)

`multiple_response`

When a multiple-response (multiple_response) variable is needed multiple must be set to true (missing=True).

You will also have the option to set the uniform_basis to True or False. If uniform_basis=True (the default) the percentage for each subvariable's selected categories will be calculated against the cumulative base for all of the subvariables. If uniform_basis=False the percentages for each subvariable will instead be calculated independently (a so-called floating base).

muliple_response variables are arrays and the categories given will be interpreted as instructions as the array's subvariables. The new multiple_response array will be given the following default set of categories:

id	name	selected	missing
1	Selected	True	False
2	Not Selected	False	False
-1	No Data	False	True

In this case categories is a list of dictionaries with the required keys id, name and case.

id will be combined with the alias of the new variable to uniquely name the multiple_response array's subvariables. For example when creating a new multiple_response with the alias drinks a "category" with the 'id': 1 will result in a subvariable with the alias drinks_1.

case indicates the condition required for that "category" (subvariable) to be considered selected.

Optionally, missing_case can be given and indicates the condition for that "category" (subvariable) to be considered missing.

Rows in the data for each "category" (subvariable), respectively, will be considered to be not selected if they satisfy neither case nor missing_case, or if they do not satisfy case and missing_case has not been given.

NOTE: Remember that not selected is still considered to be a valid response and is not missing, even though such cases will not be counted as selected for that "category" (subvariable).

missing_case can be given EITHER once to the method as a whole OR individually for each "category" (subvariable).

If missing_case is given to the method itself it will be applied automatically as the missing_case for each "category" (subvariable), as in the following example:

new_var = ds.create_categorical(
    alias='drinks', 
    name='Preferred drinks',
    description='Which drinks do you prefer?',
    notes='',
    multiple=True,
    missing_case='missing(q1)',
    uniform_basis=False,
    categories=[
        {'id': 1, 'name': 'Coke products', 'case': 'q1 in [1,2,3,4,5]'},
        {'id': 2, 'name': 'Pepsi products', 'case': 'q1 in [6,7,8,9,10]'},
        {'id': 3, 'name': 'No sugar products', 'case': 'q1 in [2,3,7,8]'},
        {'id': 4, 'name': 'Diet products', 'case': 'q1 in [2,7]'},
        {'id': 5, 'name': 'Diet alternative products', 'case': 'q1 in [3,8]'},
        {'id': 6, 'name': 'Other', 'case': 'q1 in [11,12,13,14,15]'}        
   ]
)

Alternatively, missing_case can be given explicitly to all, some or none of the "categories" (subvariables):

new_var = ds.create_categorical(
    alias='drinks', 
    name='Preferred drinks',
    description='Which drinks do you prefer?',
    notes='',
    multiple=True,
    categories=[
        {'id': 1, 'name': 'Coke products', 'case': 'q1 in [1,2,3,4,5]', 'missing_case': 'missing(q1)'},
        {'id': 2, 'name': 'Pepsi products', 'case': 'q1 in [6,7,8,9,10]', 'missing_case': 'missing(q1)'},
        {'id': 3, 'name': 'No sugar products', 'case': 'q1 in [2,3,7,8]', 'missing_case': 'missing(q1)'},
        {'id': 4, 'name': 'Diet products', 'case': 'q1 in [2,7]', 'missing_case': 'missing(q1)'},
        {'id': 5, 'name': 'Diet alternative products', 'case': 'q1 in [3,8]', 'missing_case': 'missing(q1)'},
        {'id': 6, 'name': 'Other', 'case': 'q1 in [11,12,13,14,15]', 'missing_case': 'missing(q1)'}
   ]
)

`create_fill_values`

This method creates a new single categorical derived variable, filling its values from other categorical variables.

It takes a variables argument that indicates the cases and variables to fill. Similar to the create_categorical responses object, this consists on a list of dictionaries with a case key containing a filter expression and a variable to indicate what column to use to fill.

An additional else case is allowed, in order to fill up all other rows that don't satisfy any of the indicated cases. It is possible to fill else-rows with either a variable or a default category.

The following example creates a new variable "Pop drinking frequency" basing of the conditions of the pop_pref variable. For all cases where it equals 1 (prefers Coke), use the coke_freq responses, similarly for Pepsi. And everything else, default to the other_pop_freq.

        dataset.create_fill_values([
            {"case": "pop_pref == 1", "variable": "coke_freq"},
            {"case": "pop_pref == 2", "variable": "pepsi_freq"},
            {"case": "else", "variable": "other_pop_freq"}
        ], alias="pop_freq", name="Pop drinking frequency")

Optionally, it is possible to declare a missing category, on this case, with code 99 that will map to "Not pop drinker".

        dataset.create_fill_values([
            {"case": "pop_pref == 1", "variable": "coke_freq"},
            {"case": "pop_pref == 2", "variable": "pepsi_freq"},
            {"case": "else", "name": "Not pop drinker", "id": 99, "missing": True}
        ], alias="pop_freq", name="Pop drinking frequency")

`derive_multiple_response`

This method creates a new derived multiple_response to the dataset with full control over the selected/not selected/missing categories for each subvariable.

In the following example the categories that in the new multiple_response are giving explicitly. This allows full control over the selected and missing attributes of each. For each category 'selected': False and 'missing: False' will be assumed and should be overrided explicitly (as in the following, where 1 is selected and 3 is missing).

The variable's subvariables must also be defined, including for each the case expressions for each category. These subvariables will be what is displayed as if they were categories in the UI (in this case "Coke", "Pepsi", "Other", "None" and "Code or Pepsi").

alias, name, description and notes will be inherited by the new variable directly.

new_var = ds.derive_multiple_response(
    name='Preferred drinks',
    alias='drinks',
    description='Which drinks do you prefer?',
    notes='',
    categories=[
        {'id': 1, 'name': 'Selected', 'selected': True},
        {'id': 2, 'name': 'Not Selected'},
        {'id': -1, 'name': 'No Data', 'missing': True}
    ],
    subvariables=[
        {
            'id': 1,
            'name': 'Coke',
            'cases': {
                1: 'q1 in [1]', 
                2: 'not q1 in [1] and valid(q1)', 
                -1: 'missing(q1)', 
            }
        },
        {
            'id': 2,
            'name': 'Pepsi',
            'cases': {
                1: 'q1 in [2]', 
                2: 'not q1 in [2] and valid(q1)', 
                -1: 'missing(q1)', 
            }
        },
        {
            'id': 3,
            'name': 'Other',
            'cases': {
                1: 'q1 in [95]', 
                2: 'not q1 in [95] and valid(q1)', 
                -1: 'missing(q1)', 
            }
        },
        {
            'id': 4,
            'name': 'None',
            'cases': {
                1: 'q1 in [99]', 
                2: 'not q1 in [99] and valid(q1)', 
                -1: 'missing(q1)', 
            }
        },
        {
            'id': 5,
            'name': 'Coke or Pepsi',
            'cases': {
                1: 'q1 in [1, 2]', 
                2: 'not q1 in [1, 2] and valid(q1)', 
                -1: 'missing(q1)', 
            }
        }
    ]
)

`combine_categories`

This method combines a number of categories on a given variable into a new derived variable (single or multiple response). A map argument here takes the place of the case statement from create_categorical example because because the conditions for the new variable may only rely on the parent variable from which it is being created.

For example:

comb = ds.combine_categories(
    ds['genage5'],
    alias='Age5way',
    name='5-way age',
    description='18-24/25-34/35-44/45-54/55+',
    categories={
        1: '18-24',
        2: '25-34',
        3: '35-44',
        4: '45-54',
        5: '55+',
        9: 'No answer'
    },
    map={
        1: [1, 6],
        2: [2, 7],
        3: [3, 8],
        4: [4, 9],
        5: [5, 10],
    },
    default=9,
    missing=[-1, 9]
)

alias, name and description will be inherited by the new variable directly.

categories is a dictionary of items in the form {id: name} that will be used to generate the new variable's metadata.

map is a dictionary equivalent to a shorthand case statement, where the new variable's id is keyed to one or more of the parent variable's ids.

default is where you can give the value that should be given if none of the items in the map apply to a given row of data.

missing indicates which categories, if any, should be set to missing for aggregation purposes.

A new variable created using combine_categories inherits the type of its parent, either categorical or multiple_response. This is because the same categories from the parent cannot be references more than once in the map, which means that defining overlapping combinations of responses in your map not allowed.

`bind_categorical_array`

It's possible to "bind" a group of categorical variables into a new categorical_array using bind_categorical_array. This will result in the creation of a new derived categorical_array. You should integrate the result if you later need to make non-derived edits to it, or alternatively use the method move_to_categorical_array for creating a new non-derived/material categorical_array.

ds.bind_categorical_array(
    alias="my_array",
    name="My array",
    description="This is my array",
    notes="",
    subvariables=[
    	{"id": 1, "alias": cat1_alias},
    	{"id": 2, "alias": cat2_alias},
    	{"id": 3, "alias": cat3_alias}
    ]
)

`create_numeric`

A new numeric variable can be created using a derivation expression.

ds.create_numeric(
    alias='monthly_rent',
    name='Monthly rent',
    description='Rent paid per month',
    notes='All UK adults',
    derivation='(weekly_rent * 52) / 12'
)

`missing_rules`

It's possible to set the missing rules for a numeric, text or date variable by using it's missing_rules attribute, which behaves like a dict.

Additionally, the variable has a set_missing_rules method which can be used to set multiple rules at once (or none) by passing a dict. Note that using set_missing_rules will fully replace any rules that already sit on the variable.

Remember that missing rules for categorical variables are handled directly by their categories.

>>> var.missing_rules
{}

>>> var.missing_rules['invalid'] = -1
... var.missing_rules
{u'invalid': -1}

>>> var.missing_rules['NA'] = -2
... var.missing_rules
{u'invalid': -1, u'NA': -2}

>>> var.missing_rules['NA'] = -9
... var.missing_rules
{u'invalid': -1, u'NA': -9}

>>> del var.missing_rules['NA']
... var.missing_rules
{u'invalid': -1}

>>> rules = {'none': -8, 'unknown': -9}
... var.set_missing_rules(rules)
... var.missing_rules
{u'none': -8, u'unknown': -9}

>>> var.missing_rules.clear()
>>> var.missing_rules
{}

`copy_variable`

This method adds a copy of an existing variable to the dataset. In the following example a copy of genage5 is created:

new_var = ds.copy_variable(
    ds['genage5'],
    alias='genage5_copy',
    name='Copy of Gender/Age (5 groups)'
    derived=True
)

A copied variable cannot be integrated after it exists (using var.integrate()). For that reason you need to decide at the time of its creation if it should be derived or not. However you can also choose to accept the derived status of the variable you're copying, whatever that is, by using derived=None. In that case your copy will be derived only if the source variable was derived.

`integrate`

It's possible to convert a derived variable to a non-derived variable. This is known as "integrating" the derived variable and it cannot be undone.

var.integrate()

`cast_summary`, `cast`

It's possible to cast a variable from one type to another. It may be useful to preview the result of doing this beforehand, which is what cast_summary is for. For example, to see the summary of a numeric version of a given categorical variable (without actually creating the numeric version):

summary = ds.cast_summary(var_categorical.alias, 'numeric')

If you're happy with the result then you can add a new version of the source variable to the dataset using cast:

var_numeric = ds.cast(var_categorical.alias, 'numeric')

After doing this both the original typed variable and the newly cast version of it will exist. If you no longer need the original you should hide it.

NOTE: If you want to cast a categorical to a numeric you must provide numeric_values for the categories that you want to be cast. Any categories without a numeric_value will be missing in your new numeric.

`datetime` variable resolution

datetime variables have a unique attribute known as "resolution". This attribute may describe one of two things:

The time span units described explicitly by the case data (datetime is typically stored in milliseconds)
The rollup time span units used to render the datetime case data in the app (e.g. dates are typically rendered in days, weeks, quarters or years)

The Crunch app displays datetime data by rendering the case data as if it were stored with the variable's rollup_resolution. The default rollup_resolution of a datetime variable is the case data's actual resolution.

`edit_resolution`

Analyzing datetime data in its typically millisecond resolution is often not useful and so the edit_resolution method can be used to apply a rollup_resolition using one of the following codes:

    'Y':  Year
    'Q':  Quater
    'M':  Month
    'W':  Week
    'D':  Day
    'h':  Hour
    'm':  Minute
    'ms': Millisecond

For example, to render var in the app in weeks:

var.edit_resolution('W')

Variable navigation/iteration

Dataset variables behave as a Python dictionary in the sense of a Dataset being the structure and variables the elements or keys of the dictionary. Since dictionaries are referenced by key in this case the key is a variable label and a value is an instance of the class Variable.

ds.keys() → returns a list of variable name in the dataset
ds.values() → returns a list of Variables in the dataset
ds.items() → returns a list of tuples of type (key, value)

Given this interface we can access a Dataset variable by using the Python dictionary syntax:

var = ds['my_var']

Or iterate over a datasets’s variables:

for name, var in ds.items():
    print(var.id)

Subvariables

Just like the variables of a dataset, subvariables are equally iterable.

var = ds['my_var']
subvar = var['my_subvariable']

Scrunch also exposes dictionary-like iteration for subvariables.

for subvar in var:
    print(subvar.name)

Subvariables support the edit method as Variables do, for example:

subvar.edit(name='new name')

Managing categories

Categorical variables expose their categories as a CategoryList instance which is a subclass of collections.OrderedDict.

>>> var.categories
CategoryList([
    (1, Category(name="Yes", id=1, numeric_value=None, missing=False, selected=False)),
    (2, Category(name="No", id=2, numeric_value=None, missing=False, selected=False)),
    (-1, Category(name="No Data", id=2, numeric_value=None, missing=True, selected=False))
])

`edit`

It's possible to edit all attributes of the categories except their id using the edit method.

>>> var.categories[1].edit(name="Sí", numeric_value=1)
>>> var.categories[2].edit(numeric_value=2)
>>> var.categories
CategoryList([
    (1, Category(name="Sí", id=1, numeric_value=None, missing=False, selected=False)),
    (2, Category(name="No", id=2, numeric_value=2, missing=False, selected=False)),
    (-1, Category(name="No Data", id=2, numeric_value=None, missing=True, selected=False))
])

`order`

Use the method order, which takes a list of int or str corresponding to either category id or name, to updates their order.

>>> var.categories.order(2, 1, -1)
>>> var.categories
CategoryList([
    (2, Category(name="No", id=2, numeric_value=2, missing=False, selected=False)),
    (1, Category(name="Sí", id=1, numeric_value=None, missing=False, selected=False)),
    (-1, Category(name="No Data", id=2, numeric_value=None, missing=True, selected=False))
])

All unmentioned categories will remain at the end of the order, in their current order.

'list'

You can also quickly see the id values for each category using list:

>>> list(var.categories)

[2, 1, -1]

`add_category`

You can add new categories to an existing variable:

>>> var.add_category(
... 	id=5, 
...     name="No sé", 
...     numeric_value=None, 
...     missing=False,
...     before_id=-1
... )
>>> var.categories
CategoryList([
    (2, Category(name="No", id=2, numeric_value=2, missing=False, selected=False)),
    (1, Category(name="Sí", id=1, numeric_value=None, missing=False, selected=False)),
    (5, Category(name="No sé", id=5, numeric_value=None, missing=False, selected=False)),
    (-1, Category(name="No Data", id=2, numeric_value=None, missing=True, selected=False))
])

`delete`

You can also delete a category if necessary. It is recommended any data corresponding to such a category be removed beforehand (for example by using replace_values):

var.categories[5].delete()

Order and organization via folders

Variables can be organized into nested folders for convenience. A new Folder class is introduced that contains methods to manipulate variables and subfolders.

The folders API is accessed via:

# Obtains the root folder for variables
root_folder = dataset.folders.root

`children`

The elements inside a folder are a mix of Variable and Folder instance, they are available under the .children property.

children = root_folder.children
for child in children:
    if isinstance(child, Variable):
        print("I am variable %s" % child.name)
    elif isinstance(child, Folder):
        print("I am a subfolder %s" % child.name)

`create_folder`

To organize the variables in folders, it is necessary to create subfolders using the make_subfolder method in Folder instances.

l1 = root_folder.create_folder("level 1")
l2 = l1.create_folder("level 2")
l3 = l2.create_folder("level 3")

It is possible to create subfolders in specific positions indicating before or after another element name (or variable alias) or by its particular location.

Additionally, it takes the alias keyword that receives a list of aliases of the variables to be included as part of the newly created folder. They'll be moved out of their current location.

root = self.ds.folders.root
root.create_folder("p1")
root.create_folder("p2")
root.create_folder("p3")

root.create_folder("A", before="p2")
assert [c.name for c in root.children] == ["p1", "A", "p2", "p3"])

root.make_subfolder("B", after="p2")
assert [c.name for c in root.children] == ["p1", "A", "p2", "B", "p3"])

root.make_subfolder("C", position=3)
assert [c.name for c in root.children] == ["p1", "A", "C", "p2", "B", "p3"])

`get`

Subfolders are always created as the last element of the parent folder. It is possible to get a deeply nested subfolder or variable by using its | separated path.

l3 = root_folder.get("| level 1 | level 2 | level 3")
l3.path  # prints "| level 1 | level 2 | level 3"

`move_here`

To put variables or folders inside other folders, the move_here accepts a list of instances and appends them to the target folder.

folder_a = root_folder.make_subfolder("A")
folder_b = root_folder.make_subfolder("B")

age_var = dataset['Age']
gender_var = dataset['Gender']

folder_a.move_here(age_var, folder_b, gender_var)
folder_b.move_here(age_var)

assert [c.name for c in folder_a.children] == ["B", "Gender"]
assert [c.name for c in folder_b.children] == ["Age"]

It is also possible to move variables by referencing them by alias

folder_a.move_here("Age", folder_b, "Gender")

Additionally, move_here accepts the before, after and position argument to place the newly moved items in a particular location in the graph.

`reorder`

To reorder the contents of a folder, the reorder method accepts a list of the folder children and updates the order

children = folder_a.children
assert [c.name for c in folder_a.children] == ["B", "Gender"]
folder_a.reorder(reversed(children))
assert [c.name for c in folder_a.children] == ["Gender", "B"]
folder_a.reorder(items=children)
assert [c.name for c in folder_a.children] == ["B", "Gender"]

It is also possible to send the name strings of the contents of the folder for reordering.

`rename`

To change the name of a folder, use the rename method

assert folder_a.name == "A"
folder_a.rename("Renamed")
assert folder_a.name == "Renamed"
assert folder_a.parent.get("Renamed").url == folder_a.url

Orders to folders conversion cheat sheet

Action	Orders API	Folders API	Details
Obtain the root element	`root = dataset.order['\|']`	`root = dataset.folders['\|']`
Fetch a sub folder/group	`root['\|path \| group']`	`root['\|path \| folder']`
Make a new group/subfolder	`group.create_group()`	`folder.create_folder()`	Both take `after`, `before`, `position` and `alias` optional arguments
Change the name of a group	`group.rename()`	`folder.rename()`	Both take string as the new name
Add to the end of group	`group.append()`	`folder.append()`	Receives either a list of strings (variable alias) or of `Folder` and `Variable` instances that will be added to the end of the target folder. In the folders API this is a wrapper around the `.move_here()` method
Add on arbitrary position	`group.insert()`	`folder.insert()`	Another wrapper of `Folder.move_here()` that takes a list of variable alias or instances of `Folder` and `Variable` as well as a `position` argument that will move the elements to the desired folder.
Move a variable to a group	`variable.move(path)`	`variable.move_to_folder(path)`	Both take the same path as target folder argument and optional `before`, `after` and `position` arguments to indicate the destination position of the single variable. This is also a wrapper for `Folder.move_here()`
Move many elements to a group	`not available`	`folder.move_here(elements)`	Allows to move several items to a folder in one call. It takes either a list of strings (variable aliases) or `Folder` and `Variable` instances. By default they will be appended at the end of the folder. It optionally takes the `before`, `after` and `position` arguments to indicate the location of the new elements.
Reorder the contents of a group	`group.reorder()`	`folder.reorder()`	The folders API takes a list of strings (that can be variable alias/names or folder names that are currently part of the folder and will reorder the contents accordingly. This also takes a list of `Folder` and `Variable` instances to reorder, the same outpuf of `folder.children`
The list of elements of a group	`list(group)`	`folder.children`	There is a specific `.children` method that returns a list of `Variable` or `Folder` instances of the contents

CrunchBoxes

CrunchBoxes are snapshots of a dataset that can appear in a web page outside of the Crunch App. The data captured in a CrunchBox can be multidimensional in that it can show more than just one piece of data from the dataset (end users can interact with it to change the view they are looking at).

After being created, CrunchBox data becomes static regardless of changes later made to its parent dataset. For example if new rows are appended to the parent dataset or its data is edited in some way those changes will not be expressed in a CrunchBox that was created before those changes were made. In order to take advantage of changes in the parent dataset the CrunchBox would need to be re-created.

`create_crunchbox`

Create a new CrunchBox by using the create_crunchbox method on the dataset. create_crunchbox will return an instance of scrunch.datasets.CrunchBox, see below for further details.

>>> cbox = s.create_crunchbox(
...     title='Short descriptor',
...     header='Header information for the CrunchBox',
...     footer='Footer information for the CrunchBox',
...     notes='Other information relevent for this CrunchBox',
...     variables='List of variable aliases or `Variable` instances (None=all)',
...     filters='List of filter names or Filter instances',
...     min_base_size=None # min sample size to display values in graph,
...     palette=None # dict of colors,
...     force=False
... )

The given varaibles will be available as either primary or secondary variables in the CrunchBox, meaning any of them can be used as either the main variable or as a comparison on the main variable. Since the CrunchBox is an interactive tool its end users will decide on-the-fly how they want to use each variable included in this list.

The given filters must be filters that already exist on the dataset - this is not a list of variable aliases.

Further information on providing a custom palette can be found at http://docs.crunch.io/.

`crunchboxes`

View the CrunchBoxes available on the dataset:

>>> ds.crunchboxes
[<CrunchBox: title='Short descriptor'; id='7e7e275378d8fdb63715c9053409c50e'>]

This list can also be indexed or iterated and each item is an instance of scrunch.datasets.CrunchBox, see below for further details.

`scrunch.datasets.CrunchBox`

CrunchBox instances have the following attributes and methods:

>>> cbox.widget_url # the widget url
'https://s.crunch.io/widget/index.html#/ds/7e7e275378d8fdb63715c9053409c50e/'

>>> cbox.url # the api url
'https://app.crunch.io/api/datasets/4f2ebd029ff2414191537eff6037099f/boxdata/7e7e275378d8fdb63715c9053409c50e/'

>>> cbox.iframe() # display the full iframe html code
'<figure style="text-align:left;" class="content-list-component image">  <div style="padding-bottom: 12px">    <span style="font-size: 18px; color: #444444; line-height: 1;">Human friendly identifier</span>  </div></figure><iframe src="https://s.crunch.io/widget/index.html#/ds/7e7e275378d8fdb63715c9053409c50e/" width="600" height="480" style="border: 1px solid #d3d3d3;"></iframe>'

>>> cbox.remove() # permanently delete the CrunchBox

Exporting datasets

Exporting datasets is as easy as:

CSV export (the default):

ds.export(path='/Downloads/ds.csv')

SPSS export:

ds.export(path='/Downloads/ds.sav', format='spss')

`filter`, `variables`, `hidden`, `timeout`

The export can also be refined in a number of ways. filter provides the ability to export a subset of rows and variables allows a subselection of variables. hidden=True allows you to export variables that are currently hidden (variable attribute discarded=True). timeout allows you to specify the number of seconds that should be allowed for the export to complete. If the export doesn't complete in that time a TaskProgressTimeoutError will be raised. The default timeout is None, which means the export will never time out.

ds.export(
    path='/Dowloads/ds.csv',
    filter='var1 == 4',
    variables=['var1, var2'],
    hidden=True,
    timeout=60
)

filter may be a logical expression or an actual dataset Filter referenced as per ds.filters["myfilter"].

variables is a list of variable of alias, id or variable objects.

When variables is given hidden is ignored.

timeout is either None (by default) or an int representing the number of seconds before timeout.

`options`

Additionally, an options argument in the export method accepts a dict for controlling the same export options seen in the Crunch UI. The default options values for each export format are:

CSV export default options:

options = {
    'use_category_ids': True
}

SPSS export default options:

options = {
    'prefix_subvariables': False,
    'var_label_field': 'description'
}

Example:

ds.export(
    path='/Downloads/ds.sav',
    format='spss',
    options={
        'prefix_subvariables': True
    }
)

User Reference

User Reference

Connect to the Crunch server

connect

Default connection

Get a dataset

get_dataset

project

Get a project

get_project

create_project

rename

delete

get

children

move_here

get_personal_project

members

Members

list

add, remove

edit

Teams

create_team

get_team

members, list, add, remove, edit

Editorship / ownership

get_dataset(..., editor)

change_editor

edit

add_user, edit_user, remove_user

editor

owner

change_owner

Settings

change_settings

weights

Savepoints

create_savepoint

load_savepoint

Forks

fork

name, description

preserve_owner

delete

delete_forks

merge

autorollback

Logical expressions

r()

compound expressions

dates

Deleting rows (permanently)

drop_rows

Exclusion filters (aka: drop rules)

exclude

missing, valid

get_exclusion

Dataset filters

add_filter

public

edit

Compare, Append & Join datasets

compare_dataset

append_dataset

join

Access and mutate variables

edit

hide, unhide

Headings and subtotals

add_heading

add_subtotal

transformations

Clear headings/subtotals

Material variables

create_variable

move_to_categorical_array

move_to_multiple_response

move_as_subvariable

reorder_subvariables

`connect`

`get_dataset`

`project`

`get_project`

`create_project`

`rename`

`delete`

`get`

`children`

`move_here`

`get_personal_project`

`members`

`list`

`add`, `remove`

`edit`

`create_team`

`get_team`

`members`, `list`, `add`, `remove`, `edit`

`get_dataset(..., editor)`

`change_editor`

`edit`

`add_user`, `edit_user`, `remove_user`

`editor`

`owner`

`change_owner`

`change_settings`

`weights`

`create_savepoint`

`load_savepoint`

`fork`

`name`, `description`

`preserve_owner`

`delete`

`delete_forks`

`merge`

`autorollback`

`r()`

`drop_rows`

`exclude`

`missing`, `valid`

`get_exclusion`

`add_filter`

`public`

`edit`

`compare_dataset`

`append_dataset`

`join`

`edit`

`hide`, `unhide`

`add_heading`

`add_subtotal`

`transformations`

`create_variable`

`move_to_categorical_array`

`move_to_multiple_response`

`move_as_subvariable`

`reorder_subvariables`

`unbind`

`replace_values`

`derive_weight`

`create_categorical`

`categorical`

`multiple_response`

`create_fill_values`

`derive_multiple_response`

`combine_categories`

`bind_categorical_array`

`create_numeric`

`missing_rules`

`copy_variable`

`integrate`

`cast_summary`, `cast`

`datetime` variable resolution

`edit_resolution`

`edit`

`order`

`add_category`

`delete`

`children`

`create_folder`