Skip to content

Latest commit

 

History

History
370 lines (210 loc) · 31.3 KB

README.md

File metadata and controls

370 lines (210 loc) · 31.3 KB

LISTER: Life Science Experiment Metadata Parser

This repository contains a set of files to parse documentation of experiments in eLabFTW.

A more detailed explanation regarding LISTER is provided in the following paper: https://pubs.acs.org/doi/10.1021/acs.jcim.3c00744

Additional resources are avaible to try out LISTER on your own:

Overview

LISTER is an effective tool that simplifies metadata extraction for your eLabFTW experiments, improving the findability and reproducibility of your work. By using LISTER, you can save time and effort in documenting your research and annotating your research data while ensuring that your metadata is well-organized and easily accessible. LISTER formats your metadata in both Excel format, making it easy for humans to read and machines to process, as illustrated in Table 1.

Par. No. Key Value Measure Unit
- section level 0 Remarks
- section level 0 Precultures
4 Date of experiment 29.09.2017
5 expression strain P. putida KT2440 pVLT33::pigC
5 negative control empty vector strain
5 inoculum single colony
5 growth media LB Kan 5 mL
5 temperature 30 °C
5 shaking 250 rpm
5 time overnight

Table 1. An example of the extracted metadata from LISTER in tabular Excel format.

LISTER's semi-automated metadata extraction is powered by a user-friendly annotation mechanism that empowers experiment documentation writers to effortlessly highlight metadata pairs they consider crucial. The appeal of this annotation system lies in its simplicity and readability for both humans and machines. It works as illustrated in Figure 1.

Figure 1. Lightweight annotated experiment document snippets are extracted for (a) metadata in JSON and Excel format and (b) clean Word documentation

Once the metadata is extracted, LISTER creates a clean Word document along with the metadata extraction. When exporting to a Word document, LISTER removes the annotations and simplifies the content, presenting you with a clean and simplified document that is ready for sharing or publication, ensuring that others can reproduce your experiments with ease, as illustrated in Figure 2.

Figure 2. An example of a Word experiment documentation generated by LISTER.

LISTER's annotation process can be streamlined by integrating with your eLabFTW implementation's protocol/materials and methods catalog. Our collection of materials and methods is illustrated in Figure 3. These protocols are pre-annotated using LISTER's annotation scheme (see below for details). When you import these pre-annotated catalog entries into your experiment entries (which can be done with a click in eLabFTW), you ensure consistency throughout your research documentation without having to write and annotate the documentation from scratch. The adaptation of the actual value being used, however, has to be done manually.

Figure 3. A collection of pre-annotated reusable catalog entries that can be imported to relevant experiments.

If your lab's eLabFTW implementation utilizes specific groupings (we coined these groupings as containers), such as by publication, project, study, or system, LISTER offers a convenient feature that allows you to extract metadata for all experiments within a group with just a single click. Moreover, LISTER provides the flexibility to include additional information related to your groupings (such as a publication), further enhancing the context and usability of your extracted metadata. In the following example, LISTER also provides the information of publication's title, authors, journal, status, and DOI for some experiments belonging to a publication along with related project, study, and system information. This is illustrated in Figure 4.

Figure 4. A publication consists of several linked experiments, all of which can be extracted for their metadata in a single click.

By embracing a structured approach to your eLabFTW organization, your experiments will remain well-organized and easily accessible. LISTER's one-click metadata parsing for experiments belonging to, for example, a study is an illustration of LISTER's metadata extraction efficiency. With a single click, LISTER generates a comprehensive folder structure where each experiment directory contains a Word document detailing the experiment without bracket annotations, an Excel file with a table of metadata entries, a JSON metadata file, and any attachments you have provided in your corresponding eLabFTW experiment entries. This structured output ensures that your experiment documentation is clear, concise, and ready for research data publication, sharing, and archival. This is illustrated in Figure 5.

Figure 5. A study container, once extracted, will have their experiments' metadata, documentation, and attachments extracted into respective subfolders.

The resulting metadata, once uploaded into a platform based on the Data Portal CMS DSpace 7, will be indexed for searchability. DSpace7 also supports logical search operators to search the key-value pairs for your metadata.

Screenshots

User interface

lister_experiment_tab

Figure 6. User interface for parsing an experiments.

lister_container_tab

Figure 7. User interface for parsing a container (e.g., Publications, Project, etc).

eLabFTW annotations

elabftw_exp_annotation

Figure 8. How the annotation is done to enable parsing via LISTER. This annotated fragment can be derived from reusable, and lab-curated experiment protocols/material and methods. See the Annotation Mechanism section below.

elabftw_exp_linked_items

Figure 9. Linked items section of an experiment, in which the tabular content will be parsed to gather more context w.r.t. e.g., Study, Project, and System.

Outputs

lister_xlsx_output

Figure 10. Metadata output in the Excel sheet, after parsing with LISTER.

Installing and running LISTER

LISTER is distributed as an executable file for Windows, Linux, or macOS (with an Intel chipset). The executable file for each platform is available on the release page, along with another, platform-specific file.

  • For Windows and Linux, place the executable file (lister.exe on Windows or lister on Linux) within the same folder as the config.json file.

  • For macOS, create the directory ~/Apps/lister first and place the executable lister.app and config.json in this directory.

Adapting the config.json file

Parsing an eLabFTW entry requires

  • general parameters

    • eLabFTW API token and API endpoint, which can be obtained from the eLabFTW instance's administrator of the lab or university,

    • Default output directory, i.e., a directory path used to store the parsing outputs

  • specific parameter

    • Experiment ID or Database ID for the entry to be parsed.

Annotation mechanism

The annotation mechanism allows extracting metadata from experiment documentation as .xlsx and .json files. In the following points, the basic elements of annotating protocol/MM to be parsed by LISTER are described.

  • Key-Value (KV) elements.

    • A KV pair is written as {value|key} in an experiment entry.

    • If applicable, a KV pair is extendable with measure and unit. Therefore, there are two more variations for writing a KV pair:

      • {measure|unit|key} the measure and unit will be mapped into value and unit.
      • {measure|unit|value|key} the measure and unit will be taken as given.
    • For example, “Two {100|mL|LB Kan|expression media} cultures in {unbaffled Erlenmeyer|flasks}” consists of two patterns of pair:

      • {measure|unit|value|key} -> {100|mL|LB Kan|expression media}
      • {value|key} -> {unbaffled Erlenmeyer|:flasks:}
    • Keys are hidden by default in the .docx output file to avoid superfluous text.

    • To make the keys visible, they can be placed within colons as {value|:key:}, such as {unbaffled Erlenmeyer|:flasks:}”.

  • Order.

    • As there can be identical keys within an experiment entry, disambiguation is needed.
    • The disambiguation is done through the paragraph number, which will be extracted and associated with each KV pair.
  • Comments. There are three types of comments supported in LISTER.

    • Comments parsed as-is.
      • This retains both brackets and content in the word document.
      • Annotation is done using a regular bracket ().
      • Annotation example: (This comment will be parsed as is, retaining both the content and the brackets in the .docx file.).
    • Invisible comments.
      • Used to specify additional notes (regarding, e.g., parameter use) that should be hidden from the final experiment documentation output.
      • Annotation is done using a pair of underscores inside a regular comment. (_ _)
      • Annotation example: (\_This comment will be invisible in the .docx output file.\_).
    • Comments are retained but without brackets.
      • This is typically used for comments within KV pairs.
      • Annotation is done using brackets and a double colon (: :)
      • Annotation example: (:This comment's bracket will be invisible in the .docx output file, but the text content will be kept.:).
  • Conditionals and iterations handling.

    • LISTER supports documenting conditionals and iterations, but this should be used cautiously: As the final experiment documentation is unlikely to have these conditional and iteration clauses, researchers are required to resolve them by adapting the experiment parameter values with the actual values used during the experiment.
  • Reference management.

    • References can be provided if the referred source has a DOI.
    • Annotation is done using regular brackets and providing the DOI (not URI) in the bracket.
    • The DOI will be converted into Arabic numerals in square brackets, which refer to the reference provided at the bottom of the document.
    • References are only retained in the docx output, but not the metadata outputs (.xlsx/.json).
    • Example: (DOI_CODE), such as (10.1073/pnas.062492699) will be written as [1] in the experiment body, and as a numbered list of DOI codes by the end of the experiment documentation.
  • Sections.

    • The keywords section or subsection are designated to provide a separation between sections or subsections.
    • This is done by using the <section|section name> annotation.
    • Multiple subsections are also supported, with e.g., <subsubsection|section name>, which will output different sectioning levels in the .xlsx and .json files and different heading levels in the .docx file.

Examples of annotations vs. extracted metadata

Extracted item Description Representation Example Extracted order,key, value, and optionally measure, unit in the metadata
Section The section name <section|section name> <section|Structure Preparation> "-",section level 0,Structure Preparation, -, -
Order The order of the steps, based on the order of the paragraph in the experiment documentation - - -
Key The key for the metadata, connected to the value {value|key} {sequence alignment|stage} <order>, stage, sequence alignment, -, -
Comment, please also see the bullet points about comments above for variations Comments are allowed within the key-value annotation, represented within regular brackets. Comments can be placed both/either before and/or after the key and/or value {value|(comment) key} or {value (comment)|key} or {value (comment)|(comment) key} {receptor residue|(minimization) target} <order>, target, receptor residue, -, -
Value The value of the metadata is the first item within the curly brackets {value|key} {sequence alignment|stage} , stage, sequence alignment, -, -
Measure and Unit The measure and unit of corresponding key/value pairs {measure|unit|value|key} {100|mL|LB Kan|expression media} <order>, expression media, LB Kan, 100, mL
Value and Unit In some cases, value is attached to a unit directly, without having to provide a measure {value|unit|key} {250|rpm|shaking} <order>, shaking,250, -, rpm
Control flow: for each Extract multiple key-value pairs related to for each iteration {value|unit|key} <for each|generated pose> <order>, step type,iteration, -,-        <order>, flow type,for each, -,-        <order>, flow parameter, generated pose, -,-
Control flow: for Extract multiple key-value pairs related to for iteration <for|key|[range]|iteration operation|magnitude> <for|pH|[1-7]|+|1> -<order>, step type,iteration, -, -      <order>, flow type,for, -, -               <order>, flow parameter, pH, -, -         <order>, flow range, [1-7], -, -    <order>, start iteration value,1, -, -            <order>, end iteration value,7, -,- .      <order>, flow operation, +,-, -        <order>, flow magnitude, 1, -, -
Control flow: while Extract multiple key-value pairs related to while iteration <while|key|logical operator|value> ... <iterate|iteration operation|magnitude> <while|pH|lte|7> ... <iterate|+|1> <order>, step type,iteration, -, -          <order>, flow type,while, -, -                 <order>, flow parameter, pH, -, -               <order>, flow logical parameter, lte, -, -           <order>, flow compared value, 7*, -, -<order>, flow type,*iterate*(after while)  <order>flow operation, +,-, - <order>, flow magnitude, 1, -, -
Control flow: if Extract multiple key-value pairs related to if iteration <if|key|logical operator|value> <if|pH|lte|7> <order>, step type,conditional, -,-          <order>, flow type,*if, -, -*                 <order>, flow parameter, pH.          <order>, flow logical parameter, lte, -, -         <order>, flow compared value, 7
Control flow: else if Extract multiple key-value pairs related to else if iteration <else if|key|logical operator|value> <else if|pH|between|[8-12]> <order>, step type,conditional, -, -         <order>, flow type,*else if, -, -     <order>, flow parameter, pH, -, -              <order>, flow logical parameter, between, -, -        <order>, flow range, [8-12], -, -       <order>, start iteration value,8, -, -     <order>, end iteration value,12, -, -
Control flow: else Extract multiple key-value pairs related to else iteration <else> <order>, step type,conditional, -, -    <order>, flow type,else, -, -

Supported operators

Logical operator

A logical operator is used to decide whether a particular condition is met in an iteration/conditional block. It is available for while, if , and else if control flows. The following logical operators are supported:

  • e: equal

  • ne: not equal

  • lt: less than

  • lte: less than equal

  • gt: greater than

  • gte: greater than equal

  • between: betweenk

Iteration operator

An iteration operator is used to change the value of a variable in a loop. It is available for while and for. The following iteration operators are supported:

  • +: iteration using addition

  • -: iteration using subtraction

  • %: iteration using modulo

  • *: iteration using multiplication

  • /: iteration using division

Parsable entries comparison

Instance example Can the table be parsed as metadata? Can annotated text be parsed? Should the the metadata specified before (be it table OR annotated text) be parsed?
Experiment entries - No, we still need to define how heterogenous table can be extracted to key-value pairs Yes Yes
Management-instance database entries Project, System, Study Yes, as long as it is in a  two-column structure No, it is deemed to be unnecessary as KV are already in tabular form Yes
SOP-instance database entries MM, Method, Methods, Protocol, Protocols No Yes No, it should have already been inserted into the experiment instead, and the parameter should have been adapted
Container-instance database entries Publication Yes No, it is deemed to be unnecessary as KV are already in tabular form Yes

Document validation

LISTER checks and reports the following syntax issues upon parsing:

  • Orphaned brackets.

  • Mismatched data types for conditionals and iterations.

  • Mismatched argument numbers for conditionals and iterations.

  • Invalid control flows.

Image extraction

Images are extracted from the experiment documentation, but there is no metadata or naming scheme for the extracted images.

Recommendations

  • Avoid referring to, e.g., a section without explicitly using a key-value pair (avoid, e.g., "Repeat step 1 with similar parameters"), as this will make the metadata extraction for that particular implicit step impossible.

  • To minimize confusion regarding units of measurement (e.g., fs vs ps), please explicitly state the units within the value portion of the key-value pair, e.g., {0.01|ps|gamma_ln}.

GitHub repository structure

  • The base directory contains the metadata extraction script.

  • The output directory contains the extracted metadata: step order – key – value – measure – unit in JSON and XLSX format.

Miscellaneous

Packaging LISTER

  • Packaging is done through the PyInstaller library and has to be done on the respective platform. PyInstaller should be installed first.

  • A .spec file to build LISTER can be generated using the pyi-makespec command, e.g., pyi-makespec --onedir lister.py to create a spec file to package the LISTER app as one directory instead of one file.

  • The spec file for each platform is provided in the root folder of the LISTER GitHub repository.

  • The resulting packaged app will be available under the dist directory, which is created automatically during the build process.

  • It is recommended to use virtual environments using python's venv or anaconda.

    • Using venv

      • create venv virtual environment inside lister directory named venv which will use python3.9 as the interpreter: python3.9 -m venv venv

      • set IDE to use the created venv environment as python interpreter to use - in pycharm it is in the Settings - Project: lister - Python Interpreter - Add Interpreter which is set to /lister/venv/bin/python

      • activate the venv environment: source venv/bin/activate

      • install required libraries: pip install xlsxwriter gooey python-docx elabapy beautifulsoup4 pyinstaller pandas latex2mathml

      • build using the build scripts mentioned below

Packaging the app on Windows

  • One directory version - on the root folder of the repo, run pyinstaller .\build-scripts\build-windows-onedir.spec

  • One file version - on the root folder of the repo, run pyinstaller .\build-scripts\build-windows-onefile.spec

Packaging the app on Linux

  • One file version - on the root folder of the repo, run pyinstaller build-scripts/build-linux-onefile.spec

Packaging the app on macOS

  • One file version - on the root folder of the repo, run pyinstaller build-scripts/build-macos-onefile.spec

Troubleshooting

Generic platform: slow initial execution

Decompressing a single-executable lister app into a temporary directory likely caused this problem. The multi-file distribution (aka one-directory version) can be used instead, although it is not as tidy as compared to the single-executable LISTER app.

Generic platform: contrast problem with GUI text

LISTER only supports the default OS' light theme, a custom user/dark theme is therefore not supported.

Windows: encoding error

When the following error 'charmap' codec can't encode characters in position... appears, open cmd.exe as an administrator before running LISTER and type the following:

setx /m PYTHONUTF8 1

setx PATHEXT "%PATHEXT%;.PY"

Windows: packaging failure

The error win32ctypes.pywin32.pywintypes.error: (110, 'EndUpdateResourceW', 'The system cannot open the device or file specified.' happens because of file access problems on Windows. Ensure that the directory is neither read-only nor auto-synced to cloud storage , exclude the repo folder from antivirus scanning, and/or try removing both the build and dist directories. Both of these directories are automatically generated upon packaging. Cloud storage synchronization may also be the cause of this issue.

macOS: dependencies not included

Please consider using environment management system such as anaconda to package the app. Install conda locally along with the dependencies stated in the requirements.txt. In the release, python 3.9.15 was used. LISTER runs fine on macOS v13.0.1 and macOS v10.12.4 within intel machines.

macOS: unable to get into GUI

Running lister.py directly from your IDE on macOS may lead to the following message:

This program needs access to the screen. Please run with a Framework build of python, and only when you are logged in on the main display of your Mac.

Run the script from terminal using pythonw lister.py instead.

macOS 14: Secure coding is not enabled...

The following warning appears when running lister.py directly in terminal:

Python[67201:349757] WARNING: Secure coding is not enabled for restorable state! Enable secure coding by implementing NSApplicationDelegate.applicationSupportsSecureRestorableState: and returning YES.

This warning can be ignored and does not affect the functionality of LISTER.

"Client Error: Forbidden for url: ..."

The requests.exceptions.HTTPError: 403 Client Error: Forbidden for url:... happens because the specified API token/key does not have access rights to an entry (or its underlying entries). Check that the user with specified token has access to the entries directly linked to the experiments/database items/containers.

"requests.exceptions.ConnectionError: ..."

If the following error: requests.exceptions.ConnectionError: HTTPConnectionPool(host='..., port=80): Max retries exceeded with url: ... (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at ...>: Failed to establish a new connection: [Errno 61] Connection refused')) occurs, use https instead of http as an API endpoint.