File Synchronization with rsync
Nov 22nd, 2004 by Aaron Louie
One of the major issues in maintaining a large web site with many authors is that of keeping everything up to date. In our system, authors edit HTML pages on a Windows file share, and those same files are served up on the web site. Unfortunately, this means that any changes the authors make are immediately reflected online, blemishes and all. While the ideal solution to this problem would be a content management system like Plone, a stop-gap is needed to provide some sort of staging service between the authoring server and the web server. The only problem is that our authors are accustomed to the Windows Explorer interface, and we’re trying to migrate our web servers to Linux. The solution? Synchronization with rsync. It slices, it dices, etc.
Unfortunately, the rsync server only provides two interfaces: command line and SSH. We’re syncing between two mounted file systems on a Linux box, so we don’t really need the SSH one. So that leaves us with the ugly command line. Fortunately, there’s a Perl module called File:Rsync. It basically wraps the ugly exec command in an easy(er)-to-use Perl API. It’s still ugly, and, yes, I know, it’s Perl. But it works. Kind of.
A big complaint about rsync is that, if you sync a gigantic file tree recursively, you risk running out of memory. Why? Because rsync loads the entire tree into RAM before running the sync command, eating up about 100 bytes of memory per node. Bad, bad, bad. Well, fear not, for I have written a Perl script that forces rsync to just sync one directory at a time, holding only as much as is needed to navigate to the directory being synced in memory. It’s not quite complete yet, but I’ll post links to the source when it is…
[Update 11/29/2004]
Here’s the source. Please note that this was written for a very specific context and is only presented as an example of what one can do with rsync. It will not work for any domain outside the University of Washington Libraries.
Another interesting feature of this script — and perhaps the centerpiece of its functionality — is its ability to authorize users via a Java web service that goes by the imaginative name, The Authorization Project, or, unofficially, “Authorizator”. With our synchronization script, users are authenticated via the university’s central UW NetID web service, which uses UW’s own pubcookie. This just means that we already know who our users are — we just have to find out what they are allowed to do.
This is where Authorizator comes in. Since our network admin already manages the myriad of Windows users and their permissions on the Windows file share, the authorization information is already there. The Authorizator web service provides the glue that allows us, with the information submitted in an HTML form, to find out if someone is allowed to synchronize a particular directory. Authorizator takes an HTTP query (with username, directory/file path, and domain parameters) and returns permissions data in the form of XML. It’s not pretty, but it works for now.