Archiving

by John Holbo on February 6, 2007

I have an archiving question. Suppose I wanted to save a link to a page and do my best to ensure that it stays good – for years and years, in principle, even if the page goes away. As long as google shall reign. The obvious answer: just link to a googlecache URL. One thing I’m not clear about is google’s policy about these page images. Suppose I record a page as it appeared, say, today. That is: the google page has a little ‘as retrieved on 06 Feb 2007 04:14:38 GMT’ or whatever. Does google only take a new snapshot if something changes on the page (does it have some way of knowing that?) so that, so long as the page isn’t changed, the snapshot stays good forever? Suppose I link to a page and, in two years, the page is modified and google has in the meantime taken any number of fresh cache images. There’s no expiry date on old caches, right? (I’m sure about this. But I want to be really sure. Has google made any explicit commitments, archiving-wise? I’m thinking about the long haul here.) Also, googlecache URL’s are a bit unwieldy. Tinyurl promises to offer more convenient handles that ‘never expire’. But what if tinyurl dies? Would it be bad archiving policy to tiny-fy a googlecache link, for convenience (so someone could type it in, more easily than those long monsters?) What would you say are best practices, short of doing the old fashioned thing and hitting ‘print’ and locking the results in a fireproof safe?

{ 1 trackback }

Crooked Timber » » Is it … atomic? Very atomic, sir.
02.08.07 at 1:46 pm

{ 52 comments }

1

Kieran Healy 02.06.07 at 2:08 am

There’s no expiry date on old caches, right?

I think there is — I believe if the page changes at all the cache will eventually be updated and the old one will be replaced. If you want to keep a record, then you need to save it somehow yourself (screenshot, web archive, pdf, whatever) and keep it somewhere you control, whether in your fireproof safe or on your website.

2

John Holbo 02.06.07 at 2:14 am

That’s rather important, thanks Kieran. If the page disappears, will the cache eventually disappear? (Does google have a policy?) I’m not so much worried about people rewriting history as I am about things just dying and eventually going away. But it’s true that if rewriting history is possible, then the googlecache is of limited utility as a history-recording device.

3

Joel Turnipseed 02.06.07 at 2:20 am

Use the Wayback Machine: http://www.archive.org/web/web.php

That way, you can also (more closely) support APA/MLA style for citations, right?

Of course, there’s no guarantee the Wayback folks will stay solvent either.

My suggestion? Use Wayback for most everything, and local HD for most everything else (you can save whole Web pages in IE/Firefox/Safari) & get one of them terabyte arrays Henry (was it?) was bragging about the other day. I noticed LaCie now has a $400 1TB external hard drive recently–well within the price range of all but the poorest CT readers.

4

terence 02.06.07 at 2:20 am

I always thought google caches did expire. But I don’t know where I got the idea from. Accordingly, I’d love to hear form someone who new as well.

One alternative to tiny url would be to set up a blog using blogger and simply hyperlink the word ‘here’ (or whatever word you want) to the url. that way what you see on screen is tidy – the link is just a click away and you can actually retrieve the URL itself by editing the hyperlink. Heck you could even do this with MS Word…

5

terence 02.06.07 at 2:22 am

‘someone who new’ gggrrrrrrr…not even the spell check on Firefox 2 can save me.

6

John Holbo 02.06.07 at 2:35 am

Thanks for the wayback tip, Joel. I used it just a few years ago and found that it was a bit hit and miss, for purposes of finding stuff, whereas google always sees everything. So I didn’t get in the habit. But checking now, it’s coverage seems good. At any rate, for things that it actually has found, surely this is the best way to go. I didn’t realize it created unique URL’s. It is really a citation question, not an archiving question, perhaps. But then again it’s an archiving question because I myself can’t commit to hosting content, where anyone might hope to find it, because the content doesn’t belong to me and I don’t have the personality type to be an archive myself anyway. I’m rather a sloppy backer-up of even my own stuff.

Does anyone know about the tinyurl ‘never expires’ promise? Is that even possible? There couldn’t be any way of guaranteeing a redirect if they aren’t capable of guaranteeing they’ll always own the adddress, right?

7

Luis Villa 02.06.07 at 2:36 am

I think what you want, though you don’t know it yet, is WebCite.

8

Kieran Healy 02.06.07 at 2:38 am

An example: my website and blog were hosted on fiachra.soc.arizona.edu for a couple of years around 2002 and 2003. The wayback machine as some snapshots of it. But that server no longer exists, and neither does google have any record of it in its cache, though it would have been cached at the time.

9

KCinDC 02.06.07 at 2:39 am

I’m pretty sure Google doesn’t have an archive, just one recently cached copy of some pages, and that will go away fairly soon after the page does. The Wayback Machine is a better bet, though unfortunately if the domain changes hands it seems the Internet Archive allows the new owners to determine whether they continue to archive the pages — that is, if the new owners put in a robots.txt file that excludes the archive.org bot, the earlier pages will be removed from the archive.

And of course TinyURL URLs will be completely useless if the tinyurl.com site disappears.

10

Kieran Healy 02.06.07 at 2:40 am

Never mind the possibility of tinyurl going out of business — even if they stay around and *their* link may never expire, if the underlying page no longer exists, it’s tough luck too. They’re a pointer/redirection service, not an archiving service.

11

Roy Belmont 02.06.07 at 2:41 am

So enterpeneurs could and should create an archiving service, like the Wayback but specific and on demand, could be for a fee, or ad revenue-driven. Tiered solutions, with a kind of… yeah, like those cryogenic services that freeze your brain for near-forever! Yeah! Content cryogenics! Big money! Hey!

12

fhu 02.06.07 at 2:47 am

13

fhu 02.06.07 at 2:48 am

Actually, I meant http://www.furl.net/. Lets you keep the page as it is.

14

Kieran Healy 02.06.07 at 2:48 am

I think what you want, though you don’t know it yet, is WebCite.

Hmm, intriguing.

15

John Holbo 02.06.07 at 2:53 am

Thanks for the webcite link. That looks valuable. Kieran, I wasn’t actually thinking that tinyurl would be an archiving service. I was only wondering whether there is some way to take those monsterously long googlecache links and shorten them somewhat without risking breakage. But I guess the obvious answer is that there is no way that tinurl could really have made a ‘never expires’ commitment except in the short to middle term, historically speaking. (I was asking sort of in a ‘is there some astonishing way that this is really possible?’ way.)

16

Erik Hetzner 02.06.07 at 2:58 am

Sorry, archive.org is going to be around for longer than any data you store for yourself, no matter how amazing your hd array. What if you get hit by a bus?

On the other hand, they are careful but not perfect, and they may lose some data.

Google’s cache, on the other hand, is absolutely useless for any archiving needs. It’s primarily a cache for them & only secondarily for you, and if a site goes away google forgets about it as soon possible, because what use do they have linking to a site that no longer exists?

There is also: http://www.hanzoweb.com/

17

Kieran Healy 02.06.07 at 3:06 am

Oh I see. “Never” having the usual startup meaning, “At least until the next round of VC funding.”

18

Christopher M 02.06.07 at 3:08 am

Google cache files are not permanent — they try to delete them soon after the actual page goes away. That policy was a factor in last year’s court decision (Field v. Google, see around p.23) holding that their cache doesn’t violate copyright law: “…the copy of Web pages that Google stores in its cache is present for approximately 14 to 20 days….The Court thus concludes that Google makes “intermediate and temporary storage” of the material stored in its cache, within the meaning of the DMCA.”

19

Randolph Fritz 02.06.07 at 3:11 am

It depends how secure you want your copy to be; the only certain answer is to keep copies yourself, or to have a very large and reliable archive keep the copy. The usual meaning of “cache” in computing is a place for temporary local copies, so I wouldn’t be relying on Google’s cache.

Hunh. I wonder if there is an “archive” URN standard? A way to say “page found at this URL on this date and time”? That would, if it existed, be similar to a library catalog reference; publisher, date, serial number. Of course, you’d still have to actually store the document.

(pause to research)

well…there’s an expired internet draft for something called a “DURI”–dated URI. It was written by Larry Masinter of Adobe, but it doesn’t seem to be going anywhere. Getting in moving would probably be a good project for some library science degree candidate.

20

John Holbo 02.06.07 at 3:18 am

I just got a very polite email from the tinyurl people – within 10 minutes of sending it – explaining that their links are good so long as they shall live. But obviously not longer. And they hope to live a long time. But now that I know googlecache is not a good way to go, I’m not sure I need the tinyurl option anyway. A combination of wayback and webcite looks like a decent level of protection against disappearance.

21

John Holbo 02.06.07 at 3:19 am

‘They’ meaning the tinyurl service, not the links. They didn’t just reply with a tautology.

22

Erik Hetzner 02.06.07 at 3:22 am

Randolph Fritz: It depends how secure you want your copy to be; the only certain answer is to keep copies yourself, or to have a very large and reliable archive keep the copy.

May I suggest again that keeping a copy for yourself is a good solution for your own archives, for your own notes, but not for reference in a publication? Your archive will almost certainly not outlast archive.org.

23

C 02.06.07 at 3:26 am

A service that doesn’t quite fit your use case, but is nonetheless handy for keeping around snapshots of webpages is Furl (now owned by Looksmart). With Furl, you can create your own snapshots of webpages, which will remain indefinitely on Furl’s servers. You can also download a zip file containing all of your stored content. I don’t think it gives you the ability to publicly surface a page you’ve previously cached, however, so it might not be of much use to you.

24

ogged 02.06.07 at 4:03 am

Suppose I wanted to save a link to a page and do my best to ensure that it stays good – for years and years, in principle, even if the page goes away.

Why not ask the Crooked Timber community to whip up a little script to save pages to Amazon S3? I’m not a programmer, but I expect this would be very easy, and the storage would be very cheap for you. I’d even guess that you could host a domain that points to your S3 storage (here’s someone doing just that) and make URLs that wouldn’t need to change. This would be massively useful to anyone who worries about linkrot.

25

Eszter 02.06.07 at 4:22 am

I would use the Web Archive, as suggested above. But that solution assumes the page of interest exists on it.

Alternatively, you could create a pdf of the page and put it on your own site. What I might do in such a case is create a separate HTML page on which I link to the original, but then note that if that’s no longer available, the linked-to pdf below should work.

26

Erik Hetzner 02.06.07 at 4:37 am

Eszter: on getting a site in archive.org, see:

27

Erik Hetzner 02.06.07 at 4:38 am

28

bi 02.06.07 at 5:10 am

It’s still a bit hit and miss. It can be for anywhere up to 6 months before the site gets archived.

Or how about saving with Furl, then asking the Wayback Machine to archive the immutable Furl page? I’ve not tried that, so I don’t know if there’s any reason it may not work.

29

eszter 02.06.07 at 8:23 am

Thanks, Erik.

This linked page suggests eight weeks:
http://www.alexa.com/site/help/webmasters#crawl_site

“Alexa will include your site in the next crawl of the web, usually within 8 weeks of submission.”

Of course, that’s likely still too long to meet John’s current needs.

30

Johan 02.06.07 at 9:04 am

Would this technique suit your purposes? http://codev2.cc/links/

31

lw 02.06.07 at 2:30 pm

Many pages rendered by your browser do not exist in HTML form until they are rendered– they are the output of a query against somebody’s database run through formatting software.
Amazon, ebay, or flickr “pages” are like this, for instance, as are “pages” from services that display scientific data. Keeping your own copy or paying someone else to archive data whose format and organisation is your responsibility is really the only way to preserve this kind of data, ideally in cooperation with the current curator.

Data archives (libraries, among others) have considered the problem also; you’re not the first to ask this question.
When has a page changed?

32

abb1 02.06.07 at 3:03 pm

@31: In fact this very page is dynamic; the posts and comments are kept in a database and the final html page is created on the fly (based on a template) by the php (most likely) or some other software invoked by the internet service. Thus what you really need to preserve is not so much individual pages, but the database. If you have the database, normally it shouldn’t be too difficult to figure out how to display the data. Well, I mean, if the application is is a blog then it shouldn’t be too difficult.

Google doesn’t archive databases, it itself ia a database. So, what’s needed here is a service that archives internet databases (including google’s). Though unfortunately they usually aren’t directly accessible.

33

EthanJ 02.06.07 at 3:48 pm

Keep the page yourself, with the Scrapbook Firefox extension – its so much better than a stack of printouts.

You can save it, edit it, crop it, print it, search it, attach annotations, etc… Get a single page, or follow several links deep into a site. It will save the original URL for you, with internal links structures intact. If the page disappears, you’ve got your own copy – and I think that would qualify as “fair use” if the original is no longer available.

If you need your Scrapbook to be linkable for the public, put it on your site.

34

EthanJ 02.06.07 at 3:51 pm

Scrapbook

Won Best in Class: Most Useful (Upgraded)
in Mozilla’s official “Extend Firefox” contest

35

sanbikinoraion 02.06.07 at 4:14 pm

abb1: uh… no. Preserving the output medium is exactly the point, because that’s what you’re interested in directing your readers to. Ideally, you’d want to be able to point them at the state a page was in at a particular time in order to evade the author changing or removing the page.

If you think that reconstructing output pages solely from the database “shouldn’t be too difficult” then I suspect that you haven’t thought about the problem hard enough.

36

KCinDC 02.06.07 at 4:58 pm

Yes, the output is what’s wanted. It’s irrelevant whether it comes from a HTML file sitting on a disk (which could still be changing constantly) or whether it’s constructed on the fly from database queries or the pecking of trained pigeons. There’s no way to know for sure what’s going on behind the scenes, and there’s no reason to care in this case.

37

eszter 02.06.07 at 5:07 pm

Many pages rendered by your browser do not exist in HTML form [..] flickr “pages” are like this

Hmm.. can you elaborate? Flickr tends to have very stable URLs. Sure, results to a search may change and yes, comments can be added changing some content on the page, but generally speaking, a photo’s, a set’s or a group’s URL seems constant on Flickr. (The issue of people being able to replace photos on a page is a different one, I don’t think that’s what you were addressing here.)

38

Peter 02.06.07 at 5:17 pm

I used to keep my local copies on removable hard drives. This past halloween, my apartment was broken into, and my pile of usb drives were stolen (not the desktop pcs though) as well as other small things such as all my DVDs (almost all, they did leave Red Dawn, but no CDs were taken). Worst, the fireproof lockboxes that I kept things like birth cert (what! $50 for a replacement?), social security card (did you know that by law you are limited to 10 replacement cards in your lifetime) passport (fortunately in the mail getting renewed, and woot, no chip) and car title was also stolen.

Some of those drives contained programming projects I had worked on at home for the past decade (some moved from older pcs that had been discarded), a couple drives contained several thousand PDFs of research that I had been collecting. I had also saved hundreds of web pages as PDF files.

39

abb1 02.06.07 at 5:20 pm

Nah, what you really care about is the data, not all the zillion snapshots of various presentations of this data.

40

atd 02.06.07 at 6:43 pm

Zotero does full snapshots. and plays super nice with other bibliographic programs.

http://www.zotero.org/

41

lw 02.06.07 at 6:50 pm

Flickr, like many services, generates base URL in a way closely connected to the way underlying objects are stored in the db; in flickr’s case (as with most services), there’s a core object the page is associated with. I was actually thinking of added comments or edits, both of which change the content of the page displayed; capturing the human reasoning “the photo is key; same photo, same URL, everything OK” in a program is not easy. That intuitively stable mapping may not exist; consider the document http://del.icio.us/recent,
or a vendor’s “you may also like” recommendations.
The core attribute linking the entries on that page is the result of a query against an ever-changing db. Or consider this very exchange– are comments not written by the entry’s author a part of this page or not? What if the interface was richer and links to threads with similar keywords were displayed– would those links be part of the page? What about stable URLs associated with an astronomical object that just blew up, or a gene for which there’s a newly reported functional assay? Same query, same URL, same “core” entity, but a qualitatively different page. These are not marginal cases– pages with a single definite human author and denumerable edits by that author are already rare.

Static documents are at least superficially indistinguishable from dynamic objects that are pretty different from them, making scope and synchrony questions essential for any internet archiving beyond saving a small amount of html locally. This leaves aside the unformalizable question of format/content, which is not one of HTML’s strengths.

On a local note– are the Shakespeare gardens behind Vogelback being kept up?

42

sfisher 02.06.07 at 6:59 pm

In reply to abb1 (#39): I think it depends on the situation as to whether you want to preserve a presentation of the data, or some kind database/XML dump of all the data itself. A exact presentation may be the best thing for something like an academic paper reference in which you want to say “see page 39 of pdf xxx” and point to an exact spot that’s fairly stable for humans to look at and will be easily available for a while.

On the other hand, preserving the data for machines to look at and do gymnastics on is flexible since you can theoretically present it or crunch the data again in various ways if you have all the original information somewhere. But creating new views of data isn’t always cheap or trivial. It becomes even more expensive and more complicated if the original rendering software isn’t preserved along with with the data (or if it becomes obsolete). At some point the cost of re-rendering from some specific data structure likely becomes higher than the known or suspected value of the data. Whereas a fixed-render in a standard format is likely to have pre-packaged software available to render it in a fixed and inflexible way, but at much lower cost longer into the future.

Likely you could both archive original data along with a fixed render and then you have some flexibility as well as lower cost for previewing/seeing the contents.

In reply to Peter (#38): Yikes. That’s a horrible story. That makes me consider that I should have better back-up practices for off site storage. (Or perhaps do twice-yearly DVD back ups of my USB hard drive back-ups to store off site). Not that it would ruin my life if I lost all my data on my home computer, but it would probably be more unpleasant than doing occasional off-site archiving.

43

e-tat 02.06.07 at 7:25 pm

Yeah: Furl, Scrapbook, and Spurl are all good for making archive copies. With Furl and Spurl you also get to see who else has made a copy… so you can, like, ask to borrow theirs.

44

Adam Kotsko 02.06.07 at 7:33 pm

You should probably put the safe in a fallout shelter as well.

45

abb1 02.06.07 at 7:58 pm

Sfisher, sure, clearly if what you have is a bunch of documents – you need to archive the documents.

And it’s certainly true that archiving databases often isn’t simple.

I’m just saying that conceptually you’re probably better off with a copy of the database than with a bunch of reports based on this database.

46

Eszter 02.06.07 at 9:36 pm

LW – Thanks for clarifying what you meant. However, at that rate any page can change on the Web, of course. (I guess it would be hard to get a Web Archive page to change, but even archives of media sites change on their own domains as the sites go through entire revamps and post different ads, just to name an example.)

Sure, in that sense it’s all dynamic. But since you mentioned the Flickr example with the Amazon example, I thought you were focusing more on the pages that are so dynamic that they almost certainly would not come up the same way as a result to a search by someone else using another machine logged in to another profile (or no login at all). I was surprised you’d equate Flickr to that sort of dynamic rendering, that’s all.

As for the Shakespeare garden, I’m living in Palo Alto this year so I can’t say. If it wasn’t so cold I’d say I’ll check it out next week when I’m back briefly, but I don’t think that going to happen this time around.

47

Fr. 02.06.07 at 10:32 pm

1. Archived copy (with Refresh) – With Furl, you can save a copy of a web page and it is archived for you. This means that you can access that page and read it any time you need to, even if the web site is down, or the page has changed on the original web site, or even if the page is no longer accessible for free. In those cases where the web page has changed and you would rather have the new content, you can refresh the archived copy at will. [furl]

2. Full-page capture programmes, like Zotero. Some other browser plugins can also do it.

48

Bob Calder 02.07.07 at 12:49 am

I noticed teh Internet Archive mentioned only once. There is a REASON they call it the Internet Archive. Really.

Of course there are issues. There are always issues.

49

joeo 02.07.07 at 1:52 am

WebCite looks pretty sweet.

Here is an example I just made:

http://www.webcitation.org/5MSlJEAAW

50

bemused 02.07.07 at 2:36 am

To everyone suggesting references to the wayback machine — as long as you are referring to content on a domain you control (and intend to control in future) you’re golden. But archive.org retroactively obeys the robots.txt directives it finds on current websites, and thus makes content inaccessible if the owner of a domain changes his/her mind later about whether the content should be visible.

Beware.

51

notme 02.07.07 at 3:27 am

Here’s a recent bibliography from a presentation title “New Tools for Preserving Digital Collections”, given by Tracy Seneca of the California Digital Library at the Library & Information Technology Association’s 2006 Forum. Go here for the Forum’s main page.

Then there’s the January 2007 issue of Ariadne, with relevant articles and conference reports.

That information is oriented to organizations (such as libraries) seeking to archive web pages. But you are going to have many of the problems they’ve already spent a lot of time analyzing.

If you’re only interested in “pretty good” persistence for someone else’s document, and you want to be able to point people to it with a URL, then webcite seems to be the way to go. Assuming the pages aren’t already being archived by a national library.

52

stuart 02.07.07 at 12:32 pm

But archive.org retroactively obeys the robots.txt directives it finds on current websites, and thus makes content inaccessible if the owner of a domain changes his/her mind later about whether the content should be visible.

Well especially as expired domains are now often picked up by porn redirect sites to get the extra hits from old links and favourites, so you cant even rely on dead domains staying dead (and hence no new robots.txt to change the instructions) like in the past.

Comments on this entry are closed.