Comments on: Archiving

By: Crooked Timber » » Is it … atomic? Very atomic, sir.

Crooked Timber » » Is it … atomic? Very atomic, sir. — Thu, 08 Feb 2007 13:46:49 +0000

[…] archiving post got good results. If you want to cite a webpage (in an academic paper, say) and you want to do […]

By: stuart

stuart — Wed, 07 Feb 2007 12:32:56 +0000

But archive.org retroactively obeys the robots.txt directives it finds on current websites, and thus makes content inaccessible if the owner of a domain changes his/her mind later about whether the content should be visible.

Well especially as expired domains are now often picked up by porn redirect sites to get the extra hits from old links and favourites, so you cant even rely on dead domains staying dead (and hence no new robots.txt to change the instructions) like in the past.

By: notme

notme — Wed, 07 Feb 2007 03:27:00 +0000

Here's a recent bibliography from a presentation title "New Tools for Preserving Digital Collections", given by Tracy Seneca of the California Digital Library at the Library & Information Technology Association's 2006 Forum. Go here for the Forum's main page. Then there's the January 2007 issue of Ariadne, with relevant articles and conference reports. That information is oriented to organizations (such as libraries) seeking to archive web pages. But you are going to have many of the problems they've already spent a lot of time analyzing. If you're only interested in "pretty good" persistence for someone else's document, and you want to be able to point people to it with a URL, then webcite seems to be the way to go. Assuming the pages aren't already being archived by a national library.

By: bemused

bemused — Wed, 07 Feb 2007 02:36:16 +0000

To everyone suggesting references to the wayback machine — as long as you are referring to content on a domain you control (and intend to control in future) you’re golden. But archive.org retroactively obeys the robots.txt directives it finds on current websites, and thus makes content inaccessible if the owner of a domain changes his/her mind later about whether the content should be visible.

Beware.

By: joeo

joeo — Wed, 07 Feb 2007 01:52:06 +0000

WebCite looks pretty sweet.

Here is an example I just made:

http://www.webcitation.org/5MSlJEAAW

By: Bob Calder

Bob Calder — Wed, 07 Feb 2007 00:49:21 +0000

I noticed teh Internet Archive mentioned only once. There is a REASON they call it the Internet Archive. Really.

Of course there are issues. There are always issues.

By: Fr.

Fr. — Tue, 06 Feb 2007 22:32:38 +0000

1. Archived copy (with Refresh) - With Furl, you can save a copy of a web page and it is archived for you. This means that you can access that page and read it any time you need to, even if the web site is down, or the page has changed on the original web site, or even if the page is no longer accessible for free. In those cases where the web page has changed and you would rather have the new content, you can refresh the archived copy at will. [furl] 2. Full-page capture programmes, like Zotero. Some other browser plugins can also do it.

By: Eszter

Eszter — Tue, 06 Feb 2007 21:36:46 +0000

LW – Thanks for clarifying what you meant. However, at that rate any page can change on the Web, of course. (I guess it would be hard to get a Web Archive page to change, but even archives of media sites change on their own domains as the sites go through entire revamps and post different ads, just to name an example.)

Sure, in that sense it’s all dynamic. But since you mentioned the Flickr example with the Amazon example, I thought you were focusing more on the pages that are so dynamic that they almost certainly would not come up the same way as a result to a search by someone else using another machine logged in to another profile (or no login at all). I was surprised you’d equate Flickr to that sort of dynamic rendering, that’s all.

As for the Shakespeare garden, I’m living in Palo Alto this year so I can’t say. If it wasn’t so cold I’d say I’ll check it out next week when I’m back briefly, but I don’t think that going to happen this time around.

By: abb1

abb1 — Tue, 06 Feb 2007 19:58:41 +0000

Sfisher, sure, clearly if what you have is a bunch of documents – you need to archive the documents.

And it’s certainly true that archiving databases often isn’t simple.

I’m just saying that conceptually you’re probably better off with a copy of the database than with a bunch of reports based on this database.

By: Adam Kotsko

Adam Kotsko — Tue, 06 Feb 2007 19:33:45 +0000

You should probably put the safe in a fallout shelter as well.

By: e-tat

e-tat — Tue, 06 Feb 2007 19:25:03 +0000

Yeah: Furl, Scrapbook, and Spurl are all good for making archive copies. With Furl and Spurl you also get to see who else has made a copy... so you can, like, ask to borrow theirs.

By: sfisher

sfisher — Tue, 06 Feb 2007 18:59:28 +0000

In reply to abb1 (#39): I think it depends on the situation as to whether you want to preserve a presentation of the data, or some kind database/XML dump of all the data itself. A exact presentation may be the best thing for something like an academic paper reference in which you want to say “see page 39 of pdf xxx” and point to an exact spot that’s fairly stable for humans to look at and will be easily available for a while.

On the other hand, preserving the data for machines to look at and do gymnastics on is flexible since you can theoretically present it or crunch the data again in various ways if you have all the original information somewhere. But creating new views of data isn’t always cheap or trivial. It becomes even more expensive and more complicated if the original rendering software isn’t preserved along with with the data (or if it becomes obsolete). At some point the cost of re-rendering from some specific data structure likely becomes higher than the known or suspected value of the data. Whereas a fixed-render in a standard format is likely to have pre-packaged software available to render it in a fixed and inflexible way, but at much lower cost longer into the future.

Likely you could both archive original data along with a fixed render and then you have some flexibility as well as lower cost for previewing/seeing the contents.

In reply to Peter (#38): Yikes. That’s a horrible story. That makes me consider that I should have better back-up practices for off site storage. (Or perhaps do twice-yearly DVD back ups of my USB hard drive back-ups to store off site). Not that it would ruin my life if I lost all my data on my home computer, but it would probably be more unpleasant than doing occasional off-site archiving.

By: lw

lw — Tue, 06 Feb 2007 18:50:35 +0000

Flickr, like many services, generates base URL in a way closely connected to the way underlying objects are stored in the db; in flickr’s case (as with most services), there’s a core object the page is associated with. I was actually thinking of added comments or edits, both of which change the content of the page displayed; capturing the human reasoning “the photo is key; same photo, same URL, everything OK” in a program is not easy. That intuitively stable mapping may not exist; consider the document http://del.icio.us/recent,
or a vendor’s “you may also like” recommendations.
The core attribute linking the entries on that page is the result of a query against an ever-changing db. Or consider this very exchange– are comments not written by the entry’s author a part of this page or not? What if the interface was richer and links to threads with similar keywords were displayed– would those links be part of the page? What about stable URLs associated with an astronomical object that just blew up, or a gene for which there’s a newly reported functional assay? Same query, same URL, same “core” entity, but a qualitatively different page. These are not marginal cases– pages with a single definite human author and denumerable edits by that author are already rare.

Static documents are at least superficially indistinguishable from dynamic objects that are pretty different from them, making scope and synchrony questions essential for any internet archiving beyond saving a small amount of html locally. This leaves aside the unformalizable question of format/content, which is not one of HTML’s strengths.

On a local note– are the Shakespeare gardens behind Vogelback being kept up?

By: atd

atd — Tue, 06 Feb 2007 18:43:07 +0000

Zotero does full snapshots. and plays super nice with other bibliographic programs.

http://www.zotero.org/

By: abb1

abb1 — Tue, 06 Feb 2007 17:20:53 +0000

Nah, what you really care about is the data, not all the zillion snapshots of various presentations of this data.