Separating a Large Repository

A few months ago, I posted an article about combining multiple Subversion repositories into one large repository. Some folks have expressed an interest in doing the opposite--separating one large repository into multiple smaller repositories. The process is not without its quirks, but it can be done.

At first glance, you'd conclude the process would work much the same way: Loop through the individual directories in the large repository, create smaller repositories for each one, then dump and import the contents of each directory into its small repository.

The tricky part is that the Subversion dump command dumps everything in the repository, by revision. In order to pull just a single directory, you must filter a complete dump with the "svndumpfilter" command. This blog post by AllMyBrain.com basically explains how to accomplish this in Linux. I usually have to work on a Windows box on the job, so I wrote up a script to accomplish this in a Windows batch script.

The strategy is the same as the Linux script, though. We're going to use "svnadmin dump" the large repository, then use "svndumpfilter" to filter by just the directory we want, then "svnadmin load" the results into the newly created repository. All of this can be combined into a single statement via piping:

DOS:
  1. svnadmin dump c:\my\large\repo\ |
  2. svndumpfilter include MyDirectory |
  3. svnadmin load MySmallRepo\MyDirectory

This will make a little more sense when we look at the full script. Let's just put it out there and then go through it.

DOS:
  1. SET SmallRepoPath=c:\SmallRepos
  2. SET PathToRepo=c:\BigRepo
  3. SET UNCToRepo=file:///c:/BigRepo
  4. SET PathToChkout=c:\BigRepoChkout
  5.  
  6. mkdir %PathToChkout%
  7. svn co %uncToRepo% %PathToChkout% --ignore-externals
  8. dir /A:D /B %PathToChkout%> %PathToChkout%\dirs.tmp
  9. for /F %%i in (%PathToChkout%\dirs.tmp) do (
  10.     if not %%i==.svn (
  11.         echo Processing "%%i"...
  12.         mkdir %SmallRepoPath%\%%i
  13.         svnadmin create %SmallRepoPath%\%%i
  14.         svnadmin dump %PathToRepo% | svndumpfilter include %%i | svnadmin load %SmallRepoPath%\%%i
  15.     )
  16. )
  17. del %PathToChkout%\dirs.tmp
  18. rmdir /S /Q %PathToChkout%

First, we're setting our paths. "SmallRepoPath" will be the directory holding all of the small repositories we'll be creating. "PathToRepo" and "UNCToRepo" point to the big repository as DOS and UNC paths, respectively. "PathToChkout" points to a Subversion checkout of the large repository.

First, we check out the large repository with the "svn co" command. We do this just so that we can call the "dir /A:D /B" command, which says, "List just the directories in the checkout directory." We use that output to loop through each directory in the large repository.

Then, for each directory in the large repository, we create a corresponding small repository, then do our dump/filter/load combo. Again, we're dumping the contents of the large repository, using "svndumpfilter" to filter by directory, then loading that filtered dump into the new small repository.

Finally, we just do some cleanup by removing our temp files and the checkout directory.

There are a few caveats with this code.

First, it will import all of the large repository's revisions into the smaller repository. There are svndumpfilter arguments to prevent this, such as --drop-empty-revs and --renumber-revs, but I found the Windows Subversion binaries to be problematic with these arguments. The end result is that you have more revision numbers than needed, but only the relevant data is actually imported into the repository, and viewing logs on just the imported directory will still obviously show revision logs related to that directory, so there's really little harm done.

Second, the dump/filter/load action doesn't always work on a directory that has been moved (copied/deleted) from another location within the large repository. What's worse, it won't fail, it just won't load any data into the small repository. To address this, use the --revision argument on the "svnadmin dump" command to do a dump starting at a revision after the move took place. Doing so will give the "svndumpfilter" command something it can work with.

This process is certainly more complicated to explain, but ultimately there's not that much more going on. Hopefully this explanation is helpful to you.

Avoiding the Password Prompt for SSH

It's handy to establish an SSH key between machines so that SSH-related commands don't prompt you for a password. This is handy to quickly SSH into another machine, and it's even more handy when setting up SSH commands in automated scripts. For instance, you may want to execute some rsync statements in a script that runs on a regular basis. It's better to have an established SSH key between the two machines than to have a password embedded in the script.

I recently reinstalled the OS on one of my Macs, and I've got backup scripts on my CentOS Linux box that use rsync to back up some pertinent data, so I had to reestablish the SSH key between the machines and had a hard time remembering how to do it. So this time I'm documenting what I had to re-learn.

First of all, there's a great post over at nixCraft that basically explains how to do it. But allow me to explain more thoroughly, ahem, dumbed down to my level. 

The key is remembering which machine is filling which role when you're reading the instructions. I'll call them the "Acting" machine--the one who is taking action and performing a command, let's say an rsync command--and the "Target" machine--the one who is being acted upon. In my case, the Linux server is the acting machine performing the rsync command, and my Mac is the target.

The process is simple. On the "Target" machine, generate a key, and then give that key to the "Acting" machine, which effectively gives it "permission" to login without the need to supply username/password credentials.

So, from the "Target" machine, in this case, my Mac, type the following command:

ssh-keygen -t rsa

This will generate a couple files that serve as a key for accessing the Mac. The ssh-keygen command may ask you where to store the key and what password to use. Just hit enter to use the default path and a blank password.

Next, still from the "Target" machine (my Mac), type:

ssh MyUsername@ActingServer "mkdir .ssh"
scp .ssh/id_rsa.pub MyUsername@ActingServer:.ssh/authorized_keys2

In the code above, MyUsername@ActingServer would be the username and address (for instance, perhaps the IP address) of the "Acting" machine, in my case, the Linux server. In the first line, you're just creating the .ssh directory if it doesn't exist. In the second line, you're copying the key you generated from the "Target" machine to the "Acting" machine, or from the Mac to the Linux server.  Note that the scp command will ask for the password to the MyUsername account because it is connecting to that server to send it the key.

Voile. As if by magic, the "Acting" machine should now be able to SSH into the "Target" machine without a password prompt. Correspondingly, you should be able to perform rsync and other SSH commands without a password prompt. Please note, however, that this is only a one-way key. We only gave my Linux server permission to access my Mac.

What if I want my Mac to similarly login to the server without a password prompt? In that case, the Mac and the server have effectively switched roles; the Mac is now the "Acting" machine and the server is the "Target" machine, so we just have to repeat the process from the other direction. Generate a key from the server and send it to the Mac. At that point, both machines will be able to access each other without a password prompt. 

What if I have multiple "Targets" that the "Acting" machine will connect to? For instance, perhaps I have multiple Macs, and the Linux server is running scripts on all of them. When you're sending the key to the "Acting" server with the scp command, use a different name for each key file, don't overwrite the same file each time! So in the example code above, we're sending the key as "authorized_keys2". When repeating this process for multiple targets, send the keys as "authorized_keys3", and so forth.

Hopefully this will clear up some confusion regarding this process.

Combining Repositories Into One Large Repository

I keep all my projects in separate Subversion repositories. I did this because it feels a lot cleaner this way, there is less risk in the event of repository corruption, and I use corresponding Trac projects that I also wanted to keep separate from one project to the next.

That said, there are advantages to having one single repository. No big deal, that can be done after the fact with code.

Here is some Windows code to combine all the repositories in a directory into a single big repository:

DOS:
  1. set svndir=c:\Test\svn
  2. set bigrepo=c:\Test\BigRepo
  3. set bigrepoUNC=file:///c:/Test/BigRepo
  4. set rev=0:HEAD
  5.  
  6. echo Setting up the big repository.
  7. rmdir /S /Q %bigrepo%
  8. mkdir %bigrepo%
  9. svnadmin create %bigrepo%
  10.  
  11. cd %svndir%
  12. dir /A:D /B> dirs.tmp
  13. for /F %%i in (dirs.tmp) do (
  14. echo Adding %svnDir%\%%i to the big repository:
  15. svnadmin dump -r %rev% %%i >  %%i.dmp
  16. svn mkdir -m "Making project directory %%i." --non-interactive %bigrepoUNC%/%%i
  17. svnadmin load %bigrepo% --parent-dir %%i  < %%i.dmp
  18. del /F /Q %%i.dmp
  19. )
  20. del dirs.tmp

There's really not much happening here; the process is simple. First, we create the new "big" repository with the svnadmin create statement. Second, we loop through the directory, processing each Subversion repository in the directory with a three-step process: (a) Dump the repository with the svnadmin dump statement into a temporary *.dmp file. (b) Explicitly add a new directory in the "big" repository for the current repository we're processing, with the svn mkdir statement. (c) Import the dump into the "big" repository with the svnadmin load statement. Really, the rest of the code is just looping, commenting, or cleanup code.

What have we produced? As you might expect, we now have one big repository that has all of the files and commits that were in all of the smaller repositories. The big repository will maintain its own revision numbering, so the revision numbers in your smaller repositories will not match the big repository's revision numbering, although the original commit dates will be preserved. This can be really handy for searching or similar actions that you might do from a more global perspective.

However, this approach is not without its caveats. During the import process, one entire repository is imported at a time. All of a particular repository's revisions will be "grouped" together in the big repository. As a result, revision numbers in the big repository will change every time you recreate it, if there was any new activity in the repositories it contains. For instance, revision #1050 in the big repository may parallel revision #500 in Repository X, but if a commit was added to a repository that is imported before it and the big repository is recreated, that revision would now be #1051. Additionally, although all history and dates are preserved in the revisions, the big repository has commits that are not in chronological order since the import was processed by repository. This inconsistent date/commit ordering can be confusing to some repository reporting tools and may actually render those tools useless to you when they are reporting by date.

Filtering by revision. Note that my svnadmin dump statement includes the -r argument, which specifies the beginning and ending revisions to dump. By default, I'm using "0:HEAD", which basically means "dump every revision", or "dump from the first revision to the HEAD, or latest, revision". Changing the beginning and ending revisions can be useful, especially when used with dates instead of actual revision numbers. For instance, you could change the value to {2007-01-01}:{2007-12-31} to only dump revisions that were committed in 2007.

Combining all of your smaller repositories into one big repository after the fact isn't a perfect solution, but it can be handy, and it's really easy to do when you have a script like this ready to run.

How to Fix 301 Error for Subversion Checkouts

My Linux box was hosting Subversion with no problem. I added a new repository to the several that were already present, and when I checked it out, it said, "301 Moved Permanently". Excuse me?

As it turns out, there is a 301 error section in the Subversion FAQs. It says that this typically means your Apache configuration is invalid (nope, the rest of my repositories worked just fine) or your repository has the same path as a literal directory on your web root. Ahhh!

Sure enough, my subversion path was http://myserver.com/xyz/, and I had a literal directory named "xyz" in the web root. I changed that directory name, and Subversion would then allow me to checkout the repository with no problem.

  Theme Brought to you by Directory Journal and Elegant Directory.