Version: 1.0.0-beta Author: Ismael Correa Castro (ORCID 0009-0007-3815-7053) Last updated: 2023-07-31
DNA is a file format that describes DNA. It can be used to store information about the DNA of any natural organisms, it can be any animal, plant, fungi or microorganism that exist in the planet Earth. Also, it allows to use extraterrestrial nitrogenous bases in order to describe extraterretrial organisms if any scientist in the world finds one in the universe. Additional to that, it can describe any fantasy organism by accepting any custom nitrogenous base needed for that purpose. Thinking like that, any DNA of a fantasy animal or fantasy plant can be described with this file format.
The DNA file format is based on XML and uses the extension .dna. It's part of the Scifir Collection, a collection of technologies, between libraries and file formats, that allows scientists to create scientific software and scientific machines. You can see more technologies of the Scifir Collection in the GitHub page of Scifir.
A dna file is an archive file containing the following files:
- sequence.dnal or sequence.dnac file: It contains all the DNA sequence. The dnal files is a lighter file format than the dnac file format.
- info.xml file: It contains the metadata associated with the genetic information, as is the species of the life form, his name, the authors, the organization (if there's one), among other data.
- sequence.sha256 file: It's a checksum that forbids to falsify the sequence.dnal or the sequence.dnac file.
- info.sha256: It's a checksum that forbids to falsify the info.xml file.
- Very light: Compared to other DNA file formats it's by far lighter. Each dna file using a dnal file sizes MB rather than GB. The dna files using dnac file size GB as usual. In other file formats the dna files size GB of size each, here they only size MB.
- Very structured: It allows to change one gene or another by knowing each change made, rather than editing a large DNA sequence. Additional to changing genes, it's easy to add chromosomes.
- Artificial nitrogenous bases: Artificial nitrogenous bases can be added, in order to support artificial DNAs. An artificial DNA is a concept defined in Scifir to refer to DNA created artificially with extended characteristics related to natural DNA, it's not considered an artificial DNA a natural DNA modified artificially by any means (inside a computer or in the laboratory with injections), only a DNA is an artificial DNA if it has new nitrogenous bases not present on the natural world.
- Extraterrestrial nitrogenous bases: Extraterrestrial nitrogenous bases can be added to support any extraterrestrial life form present in the universe, which possibly will be found in a not so distant future.
- Metadata: The name of the life form, the name of the species, the authors, the organization, among other metadata, can all be added here.
An example of a dnal file, the most commonly used between the dnal and the dnac file, is the following:
<?xml version="1.0" encoding="UTF-8"?>
<dna>
<chromosome name="1">
<gene name="human:(hgnc) black_hair"></gene>
<non_coding name="human:(hgnc) non_coding1"></non_coding>
<gene name="human:(hgnc) brown_eyes"></gene>
<gene>ATCGAT</gene>
</chromosome>
</dna>
An example of a dnac file is the following:
<?xml version="1.0" encoding="UTF-8"?>
<dna>
<chromosome name="1">
<gene>TGCAATCGAG</gene>
<non_coding>TAACTAAG</non_coding>
<gene>ATCGAT</gene>
</chromosome>
</dna>
An example of an info.xml file inside a .dna file is the following:
<?xml version="1.0" encoding="UTF-8"?>
<info>
<name>Dogo</name>
<species>dog</species>
<authors>Ismael Correa Castro</authors>
<date>2023-07-20</date>
<description>Very energetic dog.</description>
<organization independent="true"></organization>
</info>
The checksum file sequence.sha256 and the checksum file info.sha256 corresponds to a file generated by sha256sum.
The DNA file format has been created by Ismael Correa Castro, an industrial civil engineer and scientist of 32 years old. You can email him if you find bugs, you want to request new features, or have any other need, at [email protected]. His ORCID is 0009-0007-3815-7053, if you want to reference this work inside any publication.
The Scifir Foundation is looking for funding, in order to do some digital marketing and pay some other needs of his projects. If you want to support his technologies, and science will thank you for that, you can donate in this sponsors page.
dnal and dnac files have the following elements:
XML element | Use | Description |
---|---|---|
<dna> | Required, top level | Top-level element to represent dna |
<chromosome> | Required, any number | Adds a chromosome |
<mtdna> | Optional, required for animal DNAs | Adds a mitochondrial DNA |
<cpdna> | Optional, required for plant DNAs | Adds a chloroplasts DNA |
<gene> | Required, any number | Adds a gene sequence |
<non_coding> | Required, any number | Adds a non-coding sequence |
It's mandatory to add the element in dna files of animals, and to add the element in dna files of plants.
info.xml files have the following elements:
XML element | Use | Description |
---|---|---|
<info> | Required, top level | Top level element to represent metadata of a DNA |
<name> | Optional | Name of the life form |
<species> | Optional | Name of the species the life form is |
<authors> | Required | Name of each of the authors of the DNA sequencing and/or edition |
<date> | Required | Date of creation of the file |
<description> | Optional | Any relevant description of the life form |
<organization> | Required | Organization the authors were working for |
<data> | Optional, any number | Any additional data |
The element is the top level element of dnal and dnac files. It contains all the other elements.
The element represents a chromosome. It contains and <non_coding> elements in any number.
It has the following attribute:
Attribute | Required | Description |
---|---|---|
name | Required | Name of the chromosome to identify it from others |
The element represents a mitochondrial DNA. As , it contains and <non_coding> elements in any number. There can only be one in a DNA file, and can be optionally present, because for dna files of plants there doesn't exist the mitochondrial DNA.
The element represents a chloroplast DNA. It's similar to the element, and contains and <non_coding> elements in any number. It can be optionally present, because for dna files of animals there doesn't exist chloroplasts.
The element represents a gene. It contains the sequence of nitrogenous bases of which the gene is composed of. The nitrogenous bases are specified in lower case letters if they are not methylated, and they are specified in upper case letters if they are methylated. It's very important to use lower case letters or upper case letters appropiately, because the methylation of nitrogenous bases can seriously change the behavior of part of the DNA.
Given the fact that, coming from the theory of codons, genes should always start with AUG, any element that doesn't starts with that sequence should be wrong. In order to support any possible biological case, to start with AUG is not expressly supported.
The <non_coding> element represents a non-coding region of the DNA. It usually comprises the majority of nitrogenous bases in any DNA file, because the DNA of life forms usually contains a majority of non-coding regions. It's usually edited in small portions, specifically, in the portions of the transcription factors, to change the level of expression of some specific gene, which usually is the gene that comes next (if the region of the transcription factor is near the end of the non-coding region), or the gene that was previously (if the region of the transcription factor is near the start of the non-coding region).
THe element is the top level element of the info.xml file. It contains all the other elements of info.xml, which is all the metadata associated with the dna.
The element is the name of the life form in real life. It's intended to be used for animals, although if a plant or a microorganism has a name, being a normal name or a scientific name given some set of rules, it can be added for those kingdoms too. For pets and animals you experiment at home or inside an organization, it corresponds to the name you use to designate it.
The element is the species of the life form. It's written always in their scientific name, in binomial name, as for example ""drosophila-melanogaster". Any other nomenclature should be avoided in favor of the binomial name, but is not forbidden, because it can happen that the binomial name is not perfect for some purpose, and then other nomenclatures can be used in rare exception cases.
The element lists all the authors of the dna file. It's an author who has sequenced the dna file with a dna sequencer or a person that has created a new dna file from an existing one, by editing it with any program, and has created then a new dna file.
All the authors are listed one following the other, separated by comma. The nomenclature for writing the names of scientists is to write name in the original language, with the letters used on it. In parenthesis the name in latin letters is written if the name in the original language doesn't use latin letters. Following that, after a comma, it's written the national identification number of the scientist (which is one or another depending on his country), and after that an identifier of scientist or researchers of any kind, like ORCID, can be added. There can be any number of identifiers added in this way.
An example of authors element is the following:
<?xml version="1.0" encoding="UTF-8"?>
<authors>Ismael Correa Castro (RUT 17.705.429-1,ORCID 0009-0007-3815-7053)</authors>
Apart from ORCID, any other identifier of this same type can be added, as long as the name of it is written too, in order to know which system is.
The element specifies the day of the creation of the dna file. Only the day is specified, not the hour, because the intention is not to have statistics of creations of dna files, that can be another different software if that's needed inside a laboratory, but instead is to know the day for any purpose. It can be useful, for example, to remember when has an specific project started.
The <description element describes the life form. A description has to be has objective as possible, and can be any needed description, with everything that needs to be explained. Everything useful to explain, everything that's curious, any anomaly, and every other important event related to the life form or to the dna itself, should be explained here.
The element is the name of the organization the scientist were working for when they created the dna file. If they worked independently, the attribute independent="true" is used instead of the name.
Attribute | Required | Description |
---|---|---|
independent | Optional | It's set to true if there's no organization and the scientists have worked independently |