«« Backup Management Perspective Week 08 Status Report »»
blog header image
Persisting Just the Filename

I had an idea this morning: I can begin development of AudioMan collection persistence by only saving filenames to the database. When the application starts the filenames are loaded from the database one at a time and their metadata is reread from the file into an object (AllMetadata). That object is then used in the model (see MVC) for collection browsing and editing in the UI. When the user adds a file to the collection that file would also be added to the database -- and removed from the database when removed from the collection.

At this point the database does not keep track of metadata property values, so they have to be reread at application startup/load/initialization time. Only saving filenames allows me to get Hibernate and hsqldb running with a minimum amount of data, which will be easier to integrate into the existing application.

The next step would be saving all of the metadata property values in the database. There are a few reasons for persisting that much data:

  • At startup time only those files that have "stale" metadata would need to be reread from their files, so startup would be faster.
  • With a database I can do neat metadata property searching features like find-as-you-type
  • I need to save metadata property values for backup media because after the media is scanned into AudioMan I probably won't be able to reread it.

Database persistence would be eager in that it would save new metadata to the database as soon as it was read from the file by the Durham Metadata Framework for Eclipse. I won't have to do any saving to the database when the application is closed -- I'll just have to close the connection to the database. If the application crashes the user won't lose changes they made to their collection in that session.

Side note: why do I always get good ideas when I'm not sitting in front of a computer? Things pop into my head in the car, in the shower, on walks, exercising ... maybe I need to be AFK more often.

I guess that just goes to show me though -- sometimes people can do their best thinking far far away from work, when they are relaxed and actually have time to THINK! My memory is terrible so I've gotten into the habit of carrying note paper and a pen around with me ... I forget ideas almost as quickly as I think them up. Ha.

Posted at April 16, 2005 at 06:56 AM EST
Last updated April 16, 2005 at 06:56 AM EST
Comments

You might want to persist something else that identifies the file, other then the file name. People like to move files around, and it would be nice if you could rename a file without it losing it's identity. Maybe store hashes of files and use those to identify them, maybe along with the name. Then it would be easy to tell if they changed. You might be able to keep track of when files have just moved too. It might take a considerable amount of time to compute the hashes of thousands of files. Probably too long. Maybe just hash the first time, hash new files, and hash on command.

» Posted by: Kibbee at April 17, 2005 12:46 PM

A file hash (and I'm assuming you mean a byte hash like MD5) is not unique and would be a poor file identifier. I can have a copy of a file on my hard disk and an exact copy of the same file on a backup disc. I could even have the same file on two different partitions. Same hash, different locations.

The way I did it: each file in the database has its own unique identifier, not related to the filepath. It is just an auto-increment field. If a file is moved the UI will indicate that, probably with a big red X.

Even if I did use hashes there would be no way to tell a move from a copy. A move between disks is just a copy and delete. A move on the same partition is just a node change.

Besides that all of that, hashes have collisions, which again makes them poor identifiers. I don't think hashes were ever meant to be used in that way.

» Posted by: Ryan at April 17, 2005 12:55 PM

Not sure if I'm following this correctly but:

Doesn't it all depends on *what* data you're hashing. Off the top of my head I can think of two things you can append to the file data that would make your hash unique no matter what file or partition you put it on:

- creation date (deals with copies)
- full path (deals with moves)

True, collisions can occur however it again it depends on how big your set is etc. If you're really affraid of collisions, it's a question of doing the math and seeing if you can live with it.

» Posted by: Contributor at April 17, 2005 10:05 PM

It's an interesting discussion but it's moot. AudioMan will not try to track file moves. iTunes doesn't either.

» Posted by: Ryan at April 17, 2005 10:25 PM

The nice thing about is hash, is that you know that 2 files are the same, or may be the same. If you have one copy in one place, and one copy another place, it might be nice to know that they are the same file. You'd probably want to know if you had 3 copies of the same data on your hard drive, so you could get rid of the excess ones. You may even want to store the location of the file separately from a file, so that it can have multiple locations. This way you wouldn't need to have extra copies of everything for extra copies of the data. It might also be nice to see if one copy was updated, then update the other copies accordingly. I also think "Contributor" was right. Using proper hashes, there's very low expectation of a collision. These can be handled when the arrive. You may not want to use the hash as an identifier, but it may be nice to have when comparing files.

» Posted by: Kibbee at April 18, 2005 01:36 PM

Here's an example where hashes break down: MP3 files have metadata in ID3 tags which are part of the bytes of the file.

If I make a copy of an MP3 file and modify the copy's ID3 tag slightly, the hash no longer matches the original. They have the same audio data, different metadata and completely different hashes.

That's why I'd like to use something like MusicBrainz's TRM. It analyzes the file's audio and generates a number from it. For the same audio, this number is the same. You still can't tell a move from a copy but at least you can say it's the same audio.

So if I take my example of copying a file: Even after the metadata change on the copy, the files would still have the same TRM value because it only analyses the audio. A hash uses the audio+metadata (+creationDate+whatever) and wouldn't be the same after a metadata change.

I've put a lot of thought into these issues over the course of AudioMan's two years of development. The iTunes and Google Picasa guys probably have too. Definitely not an easy problem to solve.

» Posted by: Ryan at April 18, 2005 03:42 PM
Google
 
Search scope: Web ryanlowe.ca