User:Rocketshiporion/File Backup, Indexing & Deduplication Program

From Wikipedia, the free encyclopedia

storage format[edit]

I think the most important decision at this early stage are the storage format for the meta data. Storage formats are hard to change due to backward compatibility as soon as they are in use.

Ideally the storage format should be easy to access from different programming languages, easy to read manually, compact and make it possible to do high performance searching in the data. I also think it should be easy to merge data from two sources and treat them as one data set. The raw amount of meta data is hard to estimate, in a typical home use scenario for photos, documents and so on it can be estimated as:

  • Bytes per file observation in average (path, storage media name, file name, other meta data): 200
  • Nr files in each indexing operation: 10 000
  • Nr indexing operations per year: 50
  • Total amount of meta data: 200*10 000*50=100 MB/year

If used for all system files and on many computers in a company each night the total amount of data will be much larger. I think it will be necessary to remove some old meta data to keep down the size of the meta data.

One way of minimize the amount of meta data is to not store all meta data of every file every time it is indexed, instead one could make a reference to earlier records by ID number for each record and just note things that has changed. The main problem with this are the increased complexity and increased risk of bugs. Especially merging two datasets become complicated since the ID-numbers for the records will not be guaranteed to be unique across multiple independent datasets. For these reasons I think I prefer the simple solution of storing all meta data independently for each observation.

I think the system should be built in such a way that it is never necessary to load all the meta data to RAM.

The formats I have considered are:

At the moment I prefer SQLite, the drawback is that it can not be inspected with a text editor but there are other tools that allow inspection of the data. Advantages are that it is easier to search in the data and that it is easy to extend the file format with new tables and columns.

I have earlier made a partly implemented prototype using JSON, each indexing operation generates a JSON file like this smal example:

{
    "files": {
        "\\LAB_TREFASSYSTEM.doc": {
            "sha256": "vRrqir9HTR7SvdahrTIUxuZGStEQ0NxsdU7FY_CXyTU=", 
            "ctime": "2009-10-17T12:52:19.000000Z", 
            "is_regular": true, 
            "mode": 438, 
            "is_symlink": false, 
            "mtime": "2004-11-08T15:06:00.000000Z", 
            "atime": "2009-10-17T00:00:00.000000Z", 
            "size": 1608704
        }, 
        "\\lab2_3.xls": {
            "sha256": "-BeRAD6fmvxJrEc6yE1DCuRR9fMctQv4AQ1R9Uf0RgQ=", 
            "ctime": "2009-10-17T12:52:18.820000Z", 
            "is_regular": true, 
            "mode": 438, 
            "is_symlink": false, 
            "mtime": "2004-11-26T18:47:00.000000Z", 
            "atime": "2009-10-17T00:00:00.000000Z", 
            "size": 18944
        }, 
        "\\Lab2_komplement_gen.doc": {
            "sha256": "N6fClPi7BmzbdjQMQ3JIJd76NN0OA8N4vCN1LuCb7HE=", 
            "ctime": "2009-10-17T12:52:18.900000Z", 
            "is_regular": true, 
            "mode": 438, 
            "is_symlink": false, 
            "mtime": "2004-11-26T19:39:00.000000Z", 
            "atime": "2009-10-17T00:00:00.000000Z", 
            "size": 21504
        }
    }, 
    "versions": {
        "python_implementation": "CPython", 
        "generating_program": "search_files", 
        "os": "Windows-XP-5.1.2600-SP3", 
        "python_version": "2.6.1", 
        "program_version": "0.0.2"
    }, 
    "dir": ".\\tmp\\ElDRIV_LAB", 
    "options": {
        "verbose": false, 
        "tags": null, 
        "filename": "./file_lists/C_DS_Lars_dokument_styrgryp_zip___eldriv_lab", 
        "device": "", 
        "message": "", 
        "sha256": true
    }, 
    "utc_time": "2009-10-17T10:54:55.203000Z", 
    "host_name": "Ferarri3400Lars"
}

The drawback are that after a while there will be many such files and it will take time to scan trough them when searching. A benefit are that the user can use the explorer or similar to remove unwanted meta data.


Although I'm not familiar with JSON, I think the significant drawback (based on this example) is that the number of entries that would accumulate from an observation may make it quite cumbersome for the user to actually remove unwanted metadata using an explorer. If e.g. 5000 files are indexed in an observation, the resulting metadata file would be perhaps a megabyte or more in size, and run to around 50,000 lines. I would go with an SQL-based database table (see below), as it could be queried with a program such as OpenOffice Base.
CREATE TABLE "table_name" (
"FileName" VARCHAR NOT NULL, --the name of the file
"SHA256" VARCHAR NOT NULL, --the SHA256 hash of the file
"FileSize" INT NOT NULL, --size of the file in bytes
"CrnTime" DATETIME2 NOT NULL, --date & time the file was created
"ModTime" DATETIME2 NOT NULL, --date & time the file was last modified
"AcsTime" DATETIME2 NOT NULL); --date & time the file was last accessed
I'm not sure about what to use as the primary key though. Rocketshiporion 01:51, 22 December 2010 (UTC)
What I meant by "removing unwanted metadata" was that if each indexing operation generates one file of meta-data, then the user could remove all the meta data from one indexing operation at once. For example if the user index and backup the files each week then the user could remove all but one each month after a year or similar.
Also if the user want to look at the content of selection of DVD-R:s and other media then the user could place the meta data files from them in a folder and ask the program to look for them there. I did not intend that the user should edit the individual meta data files. Of curse nothing prevents that the meta data are stored in a SQL-database such as SQLite one file for each indexing operation but I think that in order to utilize the full power of SQL all data should be stored in one database (file). SQLite can open up to about 10 database files as one merged database but 10 is too limited for this application.
I think your SQL example are to restricted, there are no place to store the date of the indexing operation and similar fields, I think at least one more table is needed for such "meta meta data". (The data outside the files section in the JSON-example)
The core of my SQL example(See Data Model) the table fileobs is rather similar to your SQL example but I has added a number of tables for other functions that maybe makes it to complex. --Gr8xoz (talk) 13:38, 23 December 2010 (UTC)

Data model[edit]

The next thing to decide is what to store and how to organize the data. My suggestion are below, (I do not know if you know SQL but the table definition are rather straight forward.). This also gives an idea to what functionality I want to implement.

CREATE TABLE "fileobs" ( --Table with one row for each time a file
-- are observed and a row when a removed file are detected.
"device" 		VARCHAR NOT NULL ,--The name of the device 
--were the file are stored, e.g. “CD-R Lars photos Christmas 2003” or
-- “Harddrive partion C: on Bobs Laptop”
"fname" 		VARCHAR NOT NULL ,-- The filename 
--including path on the device. Not path to the device, the file 
--E:\images\carl\img1432.jpg are stored as  
--images/carl/img1432.jpg, if this are a CD-R the next time the 
--device are indexed it may be mounted as F: so it is not useful
-- to store the whole path. The pats are stored with / as path 
--separator.
"obs_time" 		DATETIME NOT NULL  DEFAULT CURRENT_TIMESTAMP ,
-- The time of the observation
"exist" 		BOOL NOT NULL ,--True if the file was 
--observed, if the file was observed in the previous indexing 
--operation but has been removed this field are False.
"size" 		INTEGER, 
"ctime" 		DATETIME, --creation time
"mtime" 		DATETIME, --modification time
"atime" 		DATETIME, --access time
"hash" 		VARCHAR, --A cryptographic hash sum of the file 
--content probably SHA256, if it was not calculated this time it
-- is NULL. Calculating the hash sum are a slow operation since 
--all the file content must be read.
"attrib" 		VARCHAR, --a string describing file 
--permissions and other file attributes, It would of course be 
--better to have separate columns for these but that is 
--problematic due to the fact that the available attributes 
--depends on the file system and operating system.
"tags" 		VARCHAR, --A comma separated list of tags, used to
-- let the user classify the files.
PRIMARY KEY ("device","fname","obs_time") );

CREATE TABLE "file_content" (--One row for each unique file that 
--has had a hash calculated, contains information about the 
--content that is unrelated to the filename and storage location.
"hash" 		VARCHAR PRIMARY KEY  NOT NULL , --A cryptographic
-- hash sum of the file content probably SHA256
"size" 		INTEGER, --Number of bytes
"aut_prio" 		FLOAT, --An automatic priority level 
--calculated according to to user configuration, based on 
--filetype, size and so on. Used when deciding the level of 
--redundancy in storage.
"man_prio" 		FLOAT, --An manually chosen priority level
-- that can be set by the user for selected files. Used when 
--deciding the level of redundancy in storage.
"compresion_ratio_est" FLOAT, --An estimate of how much the file 
--can be compressed. For small files callculated by compressing 
--the file, for large files calculated by compressing a selection 
--of data blocks from the file.
"tags" 		VARCHAR);--A comma separated list of tags, used to 
--let the user classify the files.


--A log of events or actions taken on this meta data file, such as 
--indexing the files in some directory making backup copies, 
--merging two meta data files and so on. Used to document how the 
--data and meta data are stored and handled. Also useful for 
--debugging. Can also contain errors and warnings like out of disk 
--space or unexpected changes in the files(Data corruption)
CREATE TABLE "events" (
"event_id" 		VARCHAR PRIMARY KEY  NOT NULL ,A random 
--string long enough to be almost certain that it is globally 
--unique. About 20 bytes or more.
"time" 		DATETIME DEFAULT CURRENT_TIMESTAMP ,
"program" 		VARCHAR, --the name of the programme that 
--made the action
"program_version" VARCHAR, --The program version, mainly for 
debugging.
"type" 		VARCHAR, --Event type e.g.  indexing, merge, 
--backup, remove meta data.
"description" 	VARCHAR, --Text describing the event
"comment" 		VARCHAR, --A user submitted comment.
"data" 		VARCHAR, --The used parameters for the command in 
--some format.
"operating_enviroment" VARCHAR, --Operating system, CPU for debuging.
"tags" 		VARCHAR);--A comma separated list of tags, used to 
--let the user classify the events.


--The meta data is stored in a distributed manner, therefore are 
--the event log not a simple list of events.
--The files on two computers can for example be indexed separately 
--and then merged then the merged meta data is copied and updated 
--separately and then re-merged and so on.
--This table record the relation chip between the events as an 
--directed acyclic graph (DAG)
CREATE TABLE "event_seq" (
"event_id" 		VARCHAR NOT NULL ,
"next_event_id" 	VARCHAR NOT NULL , 
PRIMARY KEY ("event_id", "next_event_id"));

--A table of the devices were the files are stored, e.g. “CD-R 
--Lars photos Christmas 2003” or 
--“Harddrive partion C: on Bobs Laptop”. 
--A device can also be a zip-file or other file archive.
CREATE TABLE "devices" (
"name" 		VARCHAR PRIMARY KEY  NOT NULL , --The name of the 
--device, normally selected by the user, for file archives it is 
--something like “zip-file SHA256:zH\DHmct4c8FgdY6+c2jeVBnEJYgPgw4jpV5SXafnPz” 
"type" 		VARCHAR,--Hardisk, CD-R, DVD,File archive
"description" 	VARCHAR,--The user may supply a longer description 
--of the device.
"reliability" 	FLOAT,--A estimate of the reliability of the device.
"owner" 		VARCHAR, 
"default_prefix_path" VARCHAR, --The mount-point for devices that 
--do not change mount-point to often for example C: or /home
"tags" 		VARCHAR);--A comma separated list of tags, used to 
--let the user classify devices.

--Each device can be subject to many threats and each threat can 
--be a threat to many devices.
--I think that often several threats such as flooding, fire, theft 
--and so on are often better to combine in to one threat based on 
--location since the difference are unimportant In this context.
CREATE TABLE "threat" (
"name" 		VARCHAR PRIMARY KEY  NOT NULL , --e.g. “building Park Avenue 12”,
--“Security hole in win XP”
"type" 		VARCHAR e.g. shared building, online threat, 
--user mistake
"description" 		VARCHAR);--The user may supply a longer 
--description of the threat

--A coupling table to connect devices and threat
CREATE TABLE "device_threat" (
"threat" 		VARCHAR NOT NULL ,
"device" 		VARCHAR NOT NULL ,
PRIMARY KEY ("threat", "device"));

--A thable with descriptions of the tags used in other tables.
--Example of tags could be “work related”, 
--“Family related, Lars L”, “private”, “program files”.
--One use of this are to copy part of the meta data to an other 
--file, e.g.:
--Your parents do not want to keep backup copies of every family
-- photo and video that you have a copy of but you do not want to 
--send them a list of all your private and work related files so 
--you send them a meta data file containing only the meta data 
--tagged “Family related, Lars L”
CREATE TABLE "tags" (
"name" 		VARCHAR PRIMARY KEY  NOT NULL , 
"description" 		VARCHAR);

This are a rather complex data model, but I currently do not see how to simplify it without losing to much functionality, do you have any ideas? Much of the complexity comes from the ability to merge meta data that has been created and/or updated separately but I think that is a useful feature, otherwise the meta data file needed to be moved from one computer to the next for each update. I think the prototype should begin with filling the core tables, file_content and fileobs. How do you like this compared to the JSON solution?, any other suggestions?

Three suggestions currently:
  • for the table fileobs, it would be useful to add an additional field - the filetype. Some users may want to back up for e.g. PNGs and ODBs seperately.
  • for the table devices, it may be useful to add the field FileSystem. It would permit sorting the devices based on what filesystem they're formatted with.
  • I'd combine the tables fileobs and file_content, as they have three fields in common (hash, size and tags), and the only unique fields possessed by file_content is aut_prio, man_prio and compression_ratio_est.

Rocketshiporion 03:55, 2 January 2011 (UTC)

Response in order:
  • A good suggestion, a intresting question is how to find the file type in a OS independent way. Not all OS uses the end of the filename (And that is stored in fname). I need to think more aboth this.
  • I do not know how to get the file-system type in an OS-independent way but otherwise a good idea.
  • I think it is a good idea and I will probably begin with out file_content and see if I run in to any problems due to this non normalized database.
--Gr8xoz (talk) 15:28, 4 January 2011 (UTC)

Backup of file content[edit]

There are many interesting features that can be implemented for storing the content of the files, such ass backup over the net with minimal trust between the computers, encrypted backup, pear to per backup, differential backup of files that has similar content and so on. I do not see this as the core functionality so my ambition is to begin with a function to copy the files that needs a new backup to a specified map.

UI[edit]

The interaction with the user can be complex, therefore are it important with a god user interface. I see four types of posible UI:s, two textual and two graphical. I think a textual user interface is needed. It is important that routine tasks can be scripted so users with limited computer skills can run them after someone has configurated them. The textual user interface can either be a command line interface were every thing is specified on the command line or it could be a scripting interface were the actions and parameters are specified in some sort of text file. Templates for common tasks with helpful comments need of curse to be supplied. One simple way of doing this are to write a python library and then let the user write a very simple python script to specify what to do. One simple example could look like this:

from FBID import * # Needed to import the library, DO NOT CHANGE
open_meta_data(C:\Documents and Settings\Lars\Mina dokument\meta.b)

device_path(Lars Documents,C:\Documents and Settings\Lars\Mina dokument)
device_path(Lars backup Hard drive 1,F:)

index_files(	Lars Documents,
		calculate_hash=when date has changed) #other possible 
		#values are “always”and “never”

select_files_not_found_on_devices([Lars backup Hard drive 1,Lars DVD-backup 2010-12-10])
backup_files(	from=Lars Documents,
		to_path=F:\backup\content)
# calculates the SHA256 for every file on 
#“Lars backup Hard drive 1” and warns if any file has changed.
verify_device(Lars backup Hard drive 1)

GUI[edit]

In addition to the text based user interface I think it often would be useful with a Graphical user interface, especially for more interactive work. This can be done as a web interface or a normal GUI. Since this program are intended to run locally I think a normal GUI is most appropriate in order to avoid security issues and configuration problems in firewalls and so on. I think the over all design will look similar to this: http://www.digitalvolcano.co.uk/content/duplicate-cleaner/screenshots (A tabbed wizard-like interface)

The program should maybe include analysis functions like this: http://windirstat.info/index.php

I think I will use wxPython as GUI-library. http://www.wxpython.org/screenshots.php The GUI can be implemented as a wrapper around the textual user interface or it can be implemented as a integral part of the program. I am not sure what the best way to go are. I think a wrapper around the textual user interface is a nicer design but I am not sure what is easiest to implement.

I think the wrapper would be easier to implement; you only need to make one interface (the scriptable textual interface), then wrap a GUI around it. Plus, it would be easier to change the GUI in future - making a new wrapper should be easier than creating a whole new GUI. Rocketshiporion 23:40, 21 December 2010 (UTC)

File selection[edit]

An important problem in both the textual and the graphical user interface are to select files for different purposes for example backup, deletion, or just to list them for the user. This are an interesting balance between ease of use, performance and expression power. That is something I would like to discus with you later.


Terminology[edit]

I use a number of words, e.g. indexing, meta data, device, file observation, file content, threat and so on, if this shall be used by more than me then it is important that these gives a clear and precise meaning. I some thing are unclear or you have suggestions for a better terminology I would appreciate if you told me.

What do you think of the over all concept, storage format, user interface and so on?

Unrelated comments in our discussion on space colonization and nuclear proliferation[edit]

Space colonization[edit]

I do not believe in colonization of Mars, free space colonization makes much more sense. I think we will begin with colonies in orbit around the Earth or Moon and then move on to colonies in orbit around the sun. I think raw material will mainly be mined on the Moon, the moons of Mars and in the asteroid belt in the beginning. Mars is a small inhospitable planet that offers very few advantages over free space colonisation. Its gravity are strong enough that transportation to and from Mars are expensive. If you are interested I could mail you some text I have written on this.

I still don't understand why the small diameter nuclear explosive are important, I would think it was the total volume and mass that was important. I do not understand why the propulsion unit need to be 500 mm in order to contain a 120 mm nuclear explosive, are you measuring the nuclear explosive without the conventional explosives lenses? My understanding of the orion-propulsion system are that it is very inefficient for a small spacecraft.

When I say 120mm for the nuclear explosive, I mean just the plutonium pit. As for the Orion-type shuttle, it would in no way be anywhere as small as the Space Shuttle - I intended a craft with a much larger overall diameter, but still significantly smaller than a full-size Orion. It would be enormous compared to the Space Shuttle; and its only similarity is that it would shuttle materiel and personnel between Earth and its outposts. I would most certainly like to read the text you have written on free space colonization. Rocketshiporion 04:32, 3 January 2011 (UTC)
I think a 120 mm plutonium pit is larger than most pits, the pit in Fat Man was only 90 mm. I do not think less than 500 mm diameter would be hard, W54 has a diameter of 270 mm and is 400 mm long. It has a mass of 23 kg and an yield of 250 ton TNT with the most powerful setting, an experimental version XW54 was tested with a yield of 6000 ton TNT. [1]
The Orion base design "Interplanetary" (total mass of the ship 4 000 000 kg) planed to use 800 nuclear explosions equivalent to 140 ton TNT explosives each to reach LEO. The Orion design "Advanced interplanetary" (total mass of the ship 10 000 000 kg) planed to use 800 nuclear explosions equivalent to 350 ton TNT explosives each to reach LEO.
The interstellar designs use 1 Mt TNT devices. W59 is a 1Mt device it has a diameter that is smaller than 500 mm, it is 414 mm but it is 1215 mm long and has a mass of 250 kg. The propulsion unit will maybe be some what larger due to the need of a directional explosion and reaction mass. I will e-mail you the text about space colonisation within some days.--Gr8xoz (talk) 16:12, 4 January 2011 (UTC)

Nuclear proliferation[edit]

Of curse currently rogue states with active program for development of nuclear weapons will are hard to stop. But I think in the long run the general use of nuclear power and nuclear explosives especially will effect the nuclear proliferation. If Orion style propulsion becomes big business then it will be hard to argue that some countries should not use it and it is very easy to weaponize the technology. It is also important to remember that rogue states are not constant, Iran was a much nicer county before 1978. What I am most afraid of are not a rogue state using a few nuclear bombs but an escalating nuclear war where some rogue state use a nuclear bomb other countries retaliate and start a chain reaction similar to the events that lead to the first world war. This could threaten the survival of the human civilisation. Some estimates of the chances of human civilisation surviving 100 years are as low as 50% and nuclear war plays an important role in this estimate. I think it is way to pessimistic but I think the risks are big enough to be taken very seriously.

I'm interested in Orion because I see it as the fastest way to get off our planet. But you're right about the possibility of a nuclear holocaust due to a political chain reaction - in my hurry to get to other parts of the Solar System, I had overlooked the possibility that the human species might use Orion to annihilate itself first, leaving no one left to actually use Orion to go anywhere! Now that I come to think of it, I can all too easily imagine a nuclear war between the US and China, North Korea, Iran, etc. wiping out a billion people. Even a war between Israel and Iran could kill a hundred million people. And then there's nuclear-armed India and Pakistan...
While farther off in the future, something like Antimatter-Catalyzed Nuclear Pulse Propulsion might be safer - it can't easily be used to destroy cities. Rocketshiporion 04:50, 3 January 2011 (UTC)