Skip to content
Luis Faria edited this page Feb 15, 2024 · 21 revisions

How it works

Database preservation toolkit converts from a source database format to a destination database format. The format may be a database management system or a preservation format.

To retrieve from a source, the application uses an import module.

To write to a destination, the application uses an export module.

To perform any intermediate actions, the application may use one or more filter modules.

It is the pair composed of an import module and an export module that provides the conversion functionality. There are different modules which can be used and even configured to provide a conversion between database formats.

General usage

The command line application takes a series of arguments, that can be provided in any order. These define the application's behavior.

java [properties] -jar dbptk-app-x.y.z.jar migrate <importModule> [import module options] -e <exportModule> [export module options] [<filterModule(s)> [filter module options]]

NOTE: For Java 9 and greater, you might need to add the following to the Java process:
java --add-opens java.xml/com.sun.org.apache.xerces.internal.jaxp=ALL-UNNAMED -jar ...

How to specify the parameters

The general use command is generic and cannot be used as is. Here are a list of modifications that must be carried out:

  • java is the java command, the full path may also be used
  • [properties] may be omitted or replaced with special configurations that influence the conversion (more details)
  • -jar dbptk-app-x.y.z.jar tells java to execute the dbptk-app-x.y.z.jar file (the file name must be adjusted to match the one you have)
  • <importModule> should be replaced with the import module specification, e.g. -i mysql or --import=postgresql
  • <exportModule> should be replaced with the export module specification, e.g. -e mysql or --export=postgresql
  • <filterModule(s)> should be replaced with a list of filter module specifications separated by ',' with no spaces, e.g. -f external-lobs or --filter=external-lobs,external-lobs
  • [import module options] should be replaced with parameters to specify the behavior of the import module, e.g. --import-username=username --import-password="p4ssw0rd" (to specify source database username and password)
  • [export module options] should be replaced with parameters to specify the behavior of the export module, e.g. --export-file=filename.siard --export-compress --export-pretty-xml (to specify the SIARD-2 export module behavior)
  • [filter module options] should be replaced with parameters to specify the behavior of the filter module(s). As there can be multiple filters declared, these parameters should contain the index of the filter they refer to in the list (even if this list is composed of only one element), e.g. --filter1-dir /home/user/ --filter1-disable-print-header (to specify the inventory filter module behavior)

Short/long parameter format

Parameters have two interchangeable formats, a longer format for readability (e.g. --import-hostname=localhost) and a short format which is faster to type (e.g. -i localhost). Notice that the difference is the shorter/longer parameter name and the number of short dashes used (there is no distinction in using space character or equal sign to separate parameters).

Parameters

Available import modules, for the [import module options] part

Specify the import module with: -i <module>, --import=module

Import module: jdbc

Short option Long option Description Mandatory
-id --import-driver=value The name of the the JDBC driver class. For more info about this refer to the website or the README file. true
-ic --import-connection=value The connection url to use in the connection. true

Note: In order to use this module you need to a JDBC driver. Please refer to this documentation on how to import your on driver.

Import module: microsoft-access

Short option Long option Description Mandatory
-if --import-file=value Path to the Microsoft Access file. true
-ip --import-password=value Password to the Microsoft Access file. false

Import module: microsoft-sql-server

Short option Long option Description Mandatory
-is --import-server-name=value The name (host name) of the server. true
-idb --import-database=value The name of the database we'll be accessing. true
-iu --import-username=value The name of the user to use in the connection. true
-ip --import-password=value The password of the user to use in the connection. true
-il --import-use-integrated-login Use windows login; by default the SQL Server login is used. false
-ide --import-disable-encryption Use to turn off encryption in the connection. false
-iin --import-instance-name=value The name of the instance. false
-ipn --import-port-number=value The port number of the server instance, default is 1433. false

Import module: mysql

Short option Long option Description Mandatory
-ih --import-hostname=value The hostname of the MySQL server. true
-idb --import-database=value The name of the MySQL database. true
-iu --import-username=value The name of the user to use in connection. true
-ip --import-password=value The password of the user to use in connection. true
-ipn --import-port-number=value The port that the MySQL server is listening, default is 3306. false
-ide --import-disable-encryption Use to turn off encryption in the connection. false

Import module: oracle

Short option Long option Description Mandatory
-is --import-server-name=value The name (or IP address) of the Oracle server. true
-ii --import-instance=value The name of the instance to use in the connection. true
-iu --import-username=value The name of the user to use in connection. true
-ip --import-password=value The password of the user to use in connection. true
-ipn --import-port-number=value The port that the Oracle server is listening, default is 1521. true
-ial --import-accept-license Declare that you accept OTN License Agreement, which is necessary to use this module. false

Import module: postgresql

Short option Long option Description Mandatory
-ih --import-hostname=value The name of the PostgreSQL server host (e.g. localhost). true
-idb --import-database=value The name of the database to connect to. true
-iu --import-username=value The name of the user to use in connection. true
-ip --import-password=value The password of the user to use in connection. true
-ide --import-disable-encryption Use to turn off encryption in the connection. false
-ipn --import-port-number=value The port of where the PostgreSQL server is listening, default is 5432. false

Import module: sybase

Short option Long option Description Mandatory
-ih --import-hostname=value The name (host name) of the server. true
-idb --import-database=value The name of the database to use in the connection. true
-iu --import-username=value The name of the user to use in connection. true
-ip --import-password=value The password of the user to use in connection. true
-ide --import-disable-encryption Use to turn off encryption in the connection. false
-ipn --import-port-number=value The port of where the Sybase server is listening, default is 2638. false

Note: In order to use this module you need to use the proprietary driver. Please refer to this documentation on how to import your on driver.

Import module: progress-openedge

Short option Long option Description Mandatory
-ih --import-hostname=value The name (host name) of the server. true
-idb --import-database=value The name of the database to use in the connection. true
-iu --import-username=value The name of the user to use in connection. true
-ip --import-password=value The password of the user to use in connection. true
-ide --import-disable-encryption Use to turn off encryption in the connection. false
-ipn --import-port-number=value The port of where the Sybase server is listening, default is 20931. false

Note: In order to use this module you need to use the proprietary driver. Please refer to this documentation on how to import your on driver.

Import module: siard-1

Short option Long option Description Mandatory
-if --import-file=value Path to SIARD1 archive file. true

Import module: siard-2

Short option Long option Description Mandatory
-if --import-file=value Path to SIARD2 archive file. true

Import module: siard-dk

Short option Long option Description Mandatory
-if --import-folder=value Path to (the first) SIARDDK archive folder. Archive folder name must match the expression AVID.[A-ZÆØÅ]{2,4}.[1-9][0-9]*.1 Any additional parts of the archive (eg. with suffixes .2 .3 etc) referenced in the tableIndex.xml will also be processed. true
-ias --import-as-schema=value Name of the database schema to use when importing the SIARDDK archive. Suggested values: PostgreSQL:'public', MySQL:'', MSSQL:'dbo' true

Import module: import-config

Short option Long option Description Mandatory
-if --import-file=value Path to the import configuration file to be read by the SIARD export module. false
-ip --import-parameters=value Pair of parameters to be resolved in the YAML configuration file. To define a pair use this syntax: key:value;key:value; false

Available export modules, for the [export module options] part

Specify the export module with: -e <module>, --export=module

Export module: jdbc

Short option Long option Description Mandatory
-ed --export-driver=value the name of the the JDBC driver class. For more info about this refer to the website or the README file. true
-ec --export-connection=value the connection url to use in the connection. true

Export module: microsoft-sql-server

Short option Long option Description Mandatory
-es --export-server-name=value The name (host name) of the server. true
-edb --export-database=value The name of the database we'll be accessing. true
-eu --export-username=value The name of the user to use in the connection. true
-ep --export-password=value The password of the user to use in the connection. true
-el --export-use-integrated-login Use windows login; by default the SQL Server login is used. false
-ede --export-disable-encryption Use to turn off encryption in the connection. false
-ein --export-instance-name=value The name of the instance. false
-epn --export-port-number=value The port number of the server instance, default is 1433. false

Export module: mysql

Short option Long option Description Mandatory
-eh --export-hostname=value The hostname of the MySQL server. true
-edb --export-database=value The name of the MySQL database. true
-eu --export-username=value The name of the user to use in connection. true
-ep --export-password=value The password of the user to use in connection. true
-epn --export-port-number=value The port that the MySQL server is listening, default is 3306. false
-ede --export-disable-encryption Use to turn off encryption in the connection. false

Export module: oracle

Short option Long option Description Mandatory
-es --export-server-name=value The name (or IP address) of the Oracle server. true
-ei --export-instance=value The name of the instance to use in the connection. true
-eu --export-username=value The name of the user to use in connection. true
-ep --export-password=value The password of the user to use in connection. true
-epn --export-port-number=value The port that the Oracle server is listening. true
-eal --export-accept-license Declare that you accept OTN License Agreement, which is necessary to use this module. false
-esc --export-source-schema=value The name of the source schema to export to the Oracle database. A schema with this name must exist in the Oracle database and it must be the default tablespace for the specified user. If omitted, the name of the first schema will be used. false

Export module: postgresql

Short option Long option Description Mandatory
-eh --export-hostname=value The name of the PostgreSQL server host (e.g. localhost). true
-edb --export-database=value The name of the database to connect to. true
-eu --export-username=value The name of the user to use in connection. true
-ep --export-password=value The password of the user to use in connection. true
-ede --export-disable-encryption Use to turn off encryption in the connection. false
-epn --export-port-number=value The port of where the PostgreSQL server is listening, default is 5432. false

Export module: siard-1

Short option Long option Description Mandatory
-ef --export-file=value Path to SIARD1 archive file true
-ec --export-compress Use to compress the SIARD1 archive file with deflate method false
-ep --export-pretty-xml Write human-readable XML false
-emd --export-meta-description=value SIARD descriptive metadata field: Description of database meaning and content as a whole. false
-ema --export-meta-archiver=value SIARD descriptive metadata field: Name of the person who carried out the archiving of the database. false
-emac --export-meta-archiver-contact=value SIARD descriptive metadata field: Contact details (telephone, email) of the person who carried out the archiving of the database. false
-emdo --export-meta-data-owner=value SIARD descriptive metadata field: Owner of the data in the database. The person or institution that, at the time of archiving, has the right to grant usage rights for the data and is responsible for compliance with legal obligations such as data protection guidelines false
-emdot --export-meta-data-origin-timespan=value SIARD descriptive metadata field: Origination period of the data in the database (approximate indication in text form). false
-emcm --export-meta-client-machine=value SIARD descriptive metadata field: DNS name of the (client) computer on which the archiving was carried out. false

Export module: siard-2

Short option Long option Description Mandatory
-ef --export-file=value Path to SIARD2 archive file. true
-ec --export-compress Use to compress the SIARD2 archive file with deflate method. false
-ep --export-pretty-xml Write human-readable XML. false
-eel --export-external-lobs Saves any LOBs outside the SIARD file. false
-eelpf --export-external-lobs-per-folder=value The maximum number of files present in an external LOB folder. Default: 1000 files. false
-eelfs --export-external-lobs-folder-size=value Divide LOBs across multiple external folders with (approximately) the specified maximum size (in Megabytes). Default: do not divide. false
-eelblobtl --export-external-lobs-blob-threshold-limit=value Keep BLOBs stored inside the SIARD file if the threshold is not exceeded (in bytes). Default: 2000 bytes false
-eelclobtl --export-external-lobs-clob-threshold-limit=value Keep CLOBs stored inside the SIARD file if the threshold is not exceeded (in bytes). Default: 4000 bytes false
-emd --export-meta-description=value SIARD descriptive metadata field: Description of database meaning and content as a whole. false
-ema --export-meta-archiver=value SIARD descriptive metadata field: Name of the person who carried out the archiving of the database. false
-emdo --export-meta-archiver-contact=value SIARD descriptive metadata field: Owner of the data in the database. The person or institution that, at the time of archiving, has the right to grant usage rights for the data and is responsible for compliance with legal obligations such as data protection guidelines. false
-emdot --export-meta-data-owner=value SIARD descriptive metadata field: Origination period of the data in the database (approximate indication in text form). false
-emcm --export-meta-client-machine=value SIARD descriptive metadata field: DNS name of the (client) computer on which the archiving was carried out. false
-egml --export-gml-directory=value Directory in which to create .gml files from tables with geometry data. false
-ed --export-digest The message digest algorithm for the type of integrity information (Supported MessageDigest algorithms here). Default: SHA-256. false
-efc --export-font-case Define the type of font case for the message digest. Supported font case are: upper case and lower case. Default: lowercase false

Export module: siard-dk

Short option Long option Description Mandatory
-ef --export-folder=value Path to SIARDDK archive folder. Archive folder name must match the expression AVID.[A-ZÆØÅ]{2,4}.[1-9][0-9]*.[1-9][0-9]. true
-eai --export-archiveIndex=value Path to archiveIndex.xml input file. false
-eci --export-contextDocumentationIndex=value Path to contextDocumentationIndex.xml input file. false
-ecf --export-contextDocumentationFolder=value Path to contextDocumentation folder which should contain the context documentation for the archive. false

Export module: import-config

Short option Long option Description Mandatory
-ef --export-file=value Path to the import configuration file true

Available filter modules, for the [filter module options] part

Specify the filter module(s) with: -f <module(s)>, --filter=module(s)

Filter module: external-lobs

Filter module: merkle-tree

Filter module: inventory

For the [properties] part

Several properties are available to modify specific conversion behaviour. You can consider them as knobs that can be turned to fine-tune the conversion.

The properties have a format like part1.part2.part3, with multiple lower-case parts separated by dots. All properties have a corresponding environment variable, like PART1_PART2_PART3 (corresponding to the previous example), with the same parts in upper-case and separated by underscores.

Properties are added to the command line like this:


... -Dpart1.part2.part3=value -Danother.property=othervalue ...

Note: in windows, each property and value pair must be enclosed in ", example ... "-Dpart1.part2.part3=value" ...

If both the environment variable and the property are set, the property is used.

For simplicity, only the properties will be described, and the environment variables can be derived from those by using uppercased letters and replacing the dots with underscores (as described above).

Available properties

Fetch size

Controls the amount of rows that are retrieved from the database and stored in memory at once.

  • dbptk.jdbc.fetchsize.default (Integer) - the first fetch size to try (default: 0, which means "use the default value suggested/calculated by the driver")
  • dbptk.jdbc.fetchsize.small (Integer) - the second fetch size to try, in case the first one caused an issue (default: 10)
  • dbptk.jdbc.fetchsize.minimum (Integer) - the last fetch size to try, in case the second one also caused an issue. This is the last try before giving up on fetching information from this table (default: 1)

Setting dbptk.jdbc.fetchsize.default to 1 fetches one row at a time, using minimal memory during the conversion but taking longer to convert the database.

For more details check https://github.com/keeps/db-preservation-toolkit/pull/292

Oracle

Controls the amount of LOB that is prefetch for each row retrieved from the database and stored in memory at once.

  • dbptk.jdbc.oracle.lobPrefetchSize (Integer) - This property allows to configure how much of the LOB data is fetched the first time is requested. (default: 4000 bytes)

For more details check https://github.com/keeps/db-preservation-toolkit/issues/437

SSH port range

Controls the open port search range by defining the minimum and maximum value to search for.

  • dbptk.ssh.port.findmin (Integer) - the minimum value to included (default: 1024)
  • dbptk.ssh.port.findmax (Integer) - the maximum value to included (default: 49151)

MapDB options

Controls the location of the directory where to save the off-heap file (depending on the size of the SIARD file this off-heap file can grow substantially)

  • dbptk.memory.dir (String) - the directory path for the off-heap file storage (default: an hidden folder named dbptk under your $HOME directory)

Timezone options

Controls the timestamp field handling from the Java. Thanks to @ateras

  • user.timezone=GMT - tells Java not to do any unexpected conversions when handling the timestamp fields