Skip to content
Thomas Vincent edited this page Jan 5, 2020 · 13 revisions

Metadata file indexer and search engine

File organization can be more flexible and file browsing less tedious.

Requirements

Here are some files with associated user-defined metadata:

/home/work/projects/ginzoo2000/doc/letter_iron_cie.doc
    -> author=me, doc_type=letter, recipient=iron_cie, date=2017-01-01

/home/work/projects/ginzoo2000/data/survey.xls
    -> author=[com_team, me], doc_type='spreadsheet', datatype=poll, recipient=iron_cie, date=2016-04-02

/home/work/meetings/weekly_2015_08_10.doc
    -> author=secretary, doc_type=minutes, date=2015-08-10

/home/work/meetings/weekly_2016_08_17.doc
    -> author=secretary, doc_type=minutes, date=2016-08-17

/home/work/meetings/weekly_2016_08_24.doc
    -> author=secretary, doc_type=minutes, date=2016-08-24

/home/work/meetings/weekly_2017_08_31.doc
    -> author=secretary, doc_type=minutes, date=2017-08-31

/home/work/meetings/weekly_2017_08_31_group.jpg
    -> date=2017-08-31

/home/work/events/letter_host.doc
    -> author=me, doc_type=letter, recipient=ruby_hotel, date=2016-06-15

I wish to list all letters I wrote:

$ ~ > list_files author=me doc_type=letter
/home/work/projects/ginzoo2000/doc/letter_iron_cie.doc
/home/work/events/letter_host.doc

and list all meeting reports in 2016:

$ ~ > list_files author=me doc_type=minutes date>=2016 date<2017
/home/work/meetings/weekly_2016_08_17.doc
/home/work/meetings/weekly_2016_08_24.doc

Ultimately I wish to browse files using metadata, independently of the underlying folder organization:

$ ~ > tree_view author doc_type
├── com_team
|   ├── spreadsheet
│   │   ├── survey.xls
├── me
|   ├── letter
|   |   ├── letter_host.doc
|   |   ├── letter_iron_cie.doc
│   ├── spreadsheet
│   │   ├── survey.xls
├── secretary
|   ├── minutes
│   │   ├── weekly_2016_08_10.doc
│   │   ├── weekly_2016_08_17.doc
│   │   ├── weekly_2016_08_24.doc
│   │   ├── weekly_2016_08_31.doc
├── unsorted
|   ├── /home/work/meetings/weekly_2017_08_31_group.jpg

Rationale

The classical organization via static nested folders is becoming less efficient. First because the amount of data files increases but also because the content information becomes more heterogeneous. Moreover, current file systems still lack a proper way of storing user-defined descriptors of files, ie metadata.

Very powerful third party solutions exist though, mostly relying on data-basing and often targeted for web applications. They add a layer on top of the data file system and the information is often enclosed in non human-readable containers (database). They thus require a piece of software to actually access, modify and query information. For desktop environments, this matter is partially addressed by so-called "intelligent" assistants like Cortana for MS Windows. They take care of parsing all the data files and trying to guess as much metadata as possible. They offer some basic keyword-based query engine which aims at being simple to formulate. The user simply types words about his query, akin to a Google search. Some accomplished and efficient tools in this vein are Recoll and Beagle.

However, these systems do not let users actually feed metadata and have their own way of labeling things. A unique tool answering this need is tagspaces, which provides a comprehensive user interface but no console tools though. An important lack is that metadata is unsorted. For example, there is no semantics difference between a tag indicating a project and a tag indicating a rating.

Aim

The proposed set of tools aims at a more efficient way of browsing files and intends to be:

  • user-driven: the user provides most of meaningful metadata. Only limited automatic discovery is provided.
  • future-proof: rely on human-readable formats, store metadata next to the data. No information is lost if tools are uninstalled.
  • as simple as possible: provide equivalent of cd and ls commands that can query metadata.
  • more flexible: enable metada-based file browsing. The user can create as many transversal views as needed with no data duplication. Each view covers specific metadata organized in a custom order.

Metadata definition

Metadata basically include descriptors stored by the file system (FS): size, creation/modification/access times, credentials, filetype ... and depend on the OS. Among these, here are the FS metadata that are automatically gathered:

  • file_type (either file or folder)
  • file_modification_date

Additional metadata are considered here to be user-driven, ie it's the user who has the control over metadata definition. Some automatic discovery tools could be used, but they are not meant to be part of the core tools and would be seen as helpers for the user to fill large amount of metadata. The choice of not relying on automatic discovery tools is made to avoid deporting the mess generated a by large amount of data into a mess in the quantity and diversity of metadata.

Since the primary goal of using metadata is here to provide access to files by means of queries on descriptors, then just let users define their own system of descriptors. This may be seen as a quite selfish, user-limited, view and could prevent easy sharing of metadata. But then it's more the responsibility of a group of user or a community to agree on a standard metadata specification. Software brings molds, users bring dough.

In practice, metadata relate to files or folders on the drive and are stored in side-car files in the JSON format (see Metadata JSON file format). The name of each metadata file is the same as the associated file, with the extension .mdf. For example:

/home/me/personal/CV/cv_long_en.doc
/home/me/personal/CV/cv_long_en.doc.mdf
/home/me/personal/CV.mdf

IMPORTANT: the naming consistency has to be maintained by the user (they have control ... and responsibility). Future improvements may rely on file content hashing and change monitoring via inchron as Beagle was doing, or fswatch.

Metadata are defined by a set of attribute/value pairs. For example:

  • project="ginzoo200"
  • author=["me", "myself", "irene"]
  • date="#2016-05-24"

Attribute format

An attribute follows the python identifier format. Valid characters are the uppercase and lowercase letters A through Z, the underscore _ and, except for the first character, the digits 0 through 9.

Examples of valid attribute names:

doc_type, composer, author, keyword, project, protocol, reviewer, creation_date, guideline101,

Attribute type

An attribute is associated with an array of values. The type of an attribute is determined by the common type of its associated values. It is resolved the first time an attribute is encountered. For other occurrences, values must be consistent with this type, else an error is produced. This applies to all values of a given attribute for all files or folders that have this attribute.

Value format

A value must only contain the following characters:
  • Alphanumerical characters: A to Z, a to z and 0 to 9
  • The hashtag # can only be used as first character, to indicate a date (see Date format)
  • Other allowed characters: _, -, +, :, . and @

Other characters (like space, &, >, etc.) are not allowed to avoid ambiguity with query operators and shell processing. Consider using underscores to replace spaces.

In general, metadata content is not meant to a have a fancy form, but only to robustly represent semantics. It is advised to reuse metadata values as much as possible.

Formatting advice:
  • adopt singular everywhere (no plural).
  • use upper case only when necessary
  • minimize the number of words

Examples of valid values:

justin_time, kay_oss, jean-pierre_jeunot, 64.42, #2016-12-17T13h29, #2013-12, #2011

Value types and casting

Supported metadata types for values are:

  • string
  • number
  • boolean (true|false)
  • date (see Date format)

All values for a given attribute must have the same type, across all files and folders. Else an error is produced.

Date format

The date format is ISO 8601: [+-]YYYY-MM-DDThh:mm:sec[Z|+hh:mm]. The implementation provided by the python iso8601 module is used, although the space character is not allowed as a separator between date and time (T is used instead). It is highly recommended to use fully qualified dates as much as possible (with at least year/month/day). If not, the 1st month / day / hour / minute / sec is used. Example: 2019-02 is interpreted as 2019-02-01T00:00:00

Metadata JSON file format

Metadata are stored in JSON files (.mdf extension) containing mappings between unique attributes (or categories) and values. Values are in an array of homogeneous type (string, number, boolean):

{
  "attribute_with_one_string_value" : ["value_string_1"],
  "attribute_with_multiple_string_values" : ["value_string_2", "value_string_3"],
  "attribute_with_one_numerical_value": [45.6],
  "attribute_with_numerical_values": [4, 8, 15, 16, 23, 42],
  "attribute_with_boolean_value": [false],
  "attribute_with_string_date": ["#2015-06-04"],
}

Note that for a given attribute and a given file or folder, associated values should also be unique. If not, duplicates are ignored anyway.

The empty string is ignored for attributes and values (the user is warned).

Reserved attributes used to store metadata from the file system are:
  • file_type (string): either 'file', 'folder' TODO: how to handle symlinks? (unix only)
  • file_modification_date (date): when the file content was last modified.
  • file_access_date (date): time of most recent access.

Commands

  • lsx: list files by querying metadata
  • cdx: change directory by querying metadata
  • treex: show a tree view based on given metadata attributes

Query format

Queries are logical conjunctions (logical AND) of predicates, separated by spaces. A predicate can be in two forms:

  1. <attribute_name><operator><qvalue>
    Select items where attribute_name matches the given constraint for any of its associated value. Examples: author=mister_tea, date<2016. qvalue must be convertible to the type of the given attribute.
  2. [<negation>]<qvalue>
    Without negation: select all items where any value of any attribute with string type is equal to qvalue. With negation: select all items where all values of any attribute with string type are not equal to qvalue. Note that qvalue is always interpreted as a string here. Indexed values are converted to string before comparing to qvalue.

Operators

The negation character ! used immediately before a string is unary and relates to the value immediately following it. It cannot be used before attribute names.

Examples of valid negations:

!felix_cited
!docx

Examples of invalid usage of !:

felix_cited!
felix_!cited

The equality operator = is binary and relates to the attribute name preceding it (left operand) and the value following it (right operand).

Examples of valid equalities:

author=felix_cited
author=felix_cited
doc_type=letter

Examples of invalid equalities:

author= felix_cited
author = felix_cited
author =felix_cited
=author

The non-equality operator != is binary and relates to the attribute name preceding it (left operand) and the value following it (right operand).

Examples of valid non-equalities:

author!=felix_cited
reviewed!=True

The relational operators <, <=, >, >= are binary and relate to the attribute name preceding it and the value following it.

Examples of valid usage of relational operators:

nb_pages<=50
temperature_celsius<37.2
creation_date>=2016-09-01 creation_date<2016-07-01
author_name>=joh

Examples of invalid usage of relational operators:

nb_pages <= 50
nb_pages<= 50
nb_pages <=50
nb_pages<=fifty
2016-09-01<creation_date
author_name=<c
<nb_pages
nb_pages>
2016-09-01<creation_date<2016-07-01
creation_date=juin

Querying dates

Dates follows iso8601. Important: dates that are not fully qualified are set to the 1st matching month / day / hour / minute / sec. For example, "#2015" is interpreted as "#2015-01-01T00:00:00" and "#2015-07" as "#2015-07-01T00:00:00". Hence the query date>#2015 does not mean "any date strictly after the year 2015" (ie 2016, 2017...), but "any date strictly after January 1st 2015". The dates "#2015-04-01" and "#2015-01-01T00:00:01" will match this query.

If one wants to actually get entries where date is strictly after 2015, one should use date>=#2016. In general, it is often misleading to use strict comparisons for dates.

cdx

Change directory by selecting a folder based on metadata query.

usage: cdx PREDICATE1 [PREDICATE2 ...]

If the set of given predicates yields a unique directory than cd to it. Else, available choices are simply displayed.

treex

Show a tree view of files and folders according the given attributes.

usage: treex ATTRIBUTE1 [ATTRIBUTE2 ...] [--show_unsorted]

Files or folders actually having the given attributes will be displayed. Each layer of the tree corresponds to a given attribute. If the option --show_unsorted is given, then all other files and folders that don't have one of the given attributes will be displayed in an "unsorted" section at the end.

Implementation

import medinx

# Parse .mdf files in a given directory and its subdirectories
full_index = medinx.parse_folder('.')

# Start a selection
selection = full_index.filter('author=me')
# Refine selection
selection.filter('doc_type=letter')
# Finally gather files in the selection
selected_files = selection.get_files()

# Start another selection, for folders
selection = full_index.filter('file_type=folder')
# Filter by value (any attribute)
selection.filter('ginzoo2000')
selected_folders = selection.get_files()

# Build a tree view
view = full_index.tree_view(['author', 'doc_type'])
for author_name, author_view in view.iteritems():
    print('author: %s' % author_name)
    for doc_type, doc_fns in author_view.iteritems():
        print('    * doc type:' % doc_type)
        for doc_fn in doc_fns:
            print('        - %s' % doc_fn)