Separating a Large Repository

A few months ago, I posted an article about combining multiple Subversion repositories into one large repository. Some folks have expressed an interest in doing the opposite--separating one large repository into multiple smaller repositories. The process is not without its quirks, but it can be done.

At first glance, you'd conclude the process would work much the same way: Loop through the individual directories in the large repository, create smaller repositories for each one, then dump and import the contents of each directory into its small repository.

The tricky part is that the Subversion dump command dumps everything in the repository, by revision. In order to pull just a single directory, you must filter a complete dump with the "svndumpfilter" command. This blog post by AllMyBrain.com basically explains how to accomplish this in Linux. I usually have to work on a Windows box on the job, so I wrote up a script to accomplish this in a Windows batch script.

The strategy is the same as the Linux script, though. We're going to use "svnadmin dump" the large repository, then use "svndumpfilter" to filter by just the directory we want, then "svnadmin load" the results into the newly created repository. All of this can be combined into a single statement via piping:

DOS:
  1. svnadmin dump c:\my\large\repo\ |
  2. svndumpfilter include MyDirectory |
  3. svnadmin load MySmallRepo\MyDirectory

This will make a little more sense when we look at the full script. Let's just put it out there and then go through it.

DOS:
  1. SET SmallRepoPath=c:\SmallRepos
  2. SET PathToRepo=c:\BigRepo
  3. SET UNCToRepo=file:///c:/BigRepo
  4. SET PathToChkout=c:\BigRepoChkout
  5.  
  6. mkdir %PathToChkout%
  7. svn co %uncToRepo% %PathToChkout% --ignore-externals
  8. dir /A:D /B %PathToChkout%> %PathToChkout%\dirs.tmp
  9. for /F %%i in (%PathToChkout%\dirs.tmp) do (
  10.     if not %%i==.svn (
  11.         echo Processing "%%i"...
  12.         mkdir %SmallRepoPath%\%%i
  13.         svnadmin create %SmallRepoPath%\%%i
  14.         svnadmin dump %PathToRepo% | svndumpfilter include %%i | svnadmin load %SmallRepoPath%\%%i
  15.     )
  16. )
  17. del %PathToChkout%\dirs.tmp
  18. rmdir /S /Q %PathToChkout%

First, we're setting our paths. "SmallRepoPath" will be the directory holding all of the small repositories we'll be creating. "PathToRepo" and "UNCToRepo" point to the big repository as DOS and UNC paths, respectively. "PathToChkout" points to a Subversion checkout of the large repository.

First, we check out the large repository with the "svn co" command. We do this just so that we can call the "dir /A:D /B" command, which says, "List just the directories in the checkout directory." We use that output to loop through each directory in the large repository.

Then, for each directory in the large repository, we create a corresponding small repository, then do our dump/filter/load combo. Again, we're dumping the contents of the large repository, using "svndumpfilter" to filter by directory, then loading that filtered dump into the new small repository.

Finally, we just do some cleanup by removing our temp files and the checkout directory.

There are a few caveats with this code.

First, it will import all of the large repository's revisions into the smaller repository. There are svndumpfilter arguments to prevent this, such as --drop-empty-revs and --renumber-revs, but I found the Windows Subversion binaries to be problematic with these arguments. The end result is that you have more revision numbers than needed, but only the relevant data is actually imported into the repository, and viewing logs on just the imported directory will still obviously show revision logs related to that directory, so there's really little harm done.

Second, the dump/filter/load action doesn't always work on a directory that has been moved (copied/deleted) from another location within the large repository. What's worse, it won't fail, it just won't load any data into the small repository. To address this, use the --revision argument on the "svnadmin dump" command to do a dump starting at a revision after the move took place. Doing so will give the "svndumpfilter" command something it can work with.

This process is certainly more complicated to explain, but ultimately there's not that much more going on. Hopefully this explanation is helpful to you.

Combining Repositories Into One Large Repository

I keep all my projects in separate Subversion repositories. I did this because it feels a lot cleaner this way, there is less risk in the event of repository corruption, and I use corresponding Trac projects that I also wanted to keep separate from one project to the next.

That said, there are advantages to having one single repository. No big deal, that can be done after the fact with code.

Here is some Windows code to combine all the repositories in a directory into a single big repository:

DOS:
  1. set svndir=c:\Test\svn
  2. set bigrepo=c:\Test\BigRepo
  3. set bigrepoUNC=file:///c:/Test/BigRepo
  4. set rev=0:HEAD
  5.  
  6. echo Setting up the big repository.
  7. rmdir /S /Q %bigrepo%
  8. mkdir %bigrepo%
  9. svnadmin create %bigrepo%
  10.  
  11. cd %svndir%
  12. dir /A:D /B> dirs.tmp
  13. for /F %%i in (dirs.tmp) do (
  14. echo Adding %svnDir%\%%i to the big repository:
  15. svnadmin dump -r %rev% %%i >  %%i.dmp
  16. svn mkdir -m "Making project directory %%i." --non-interactive %bigrepoUNC%/%%i
  17. svnadmin load %bigrepo% --parent-dir %%i  < %%i.dmp
  18. del /F /Q %%i.dmp
  19. )
  20. del dirs.tmp

There's really not much happening here; the process is simple. First, we create the new "big" repository with the svnadmin create statement. Second, we loop through the directory, processing each Subversion repository in the directory with a three-step process: (a) Dump the repository with the svnadmin dump statement into a temporary *.dmp file. (b) Explicitly add a new directory in the "big" repository for the current repository we're processing, with the svn mkdir statement. (c) Import the dump into the "big" repository with the svnadmin load statement. Really, the rest of the code is just looping, commenting, or cleanup code.

What have we produced? As you might expect, we now have one big repository that has all of the files and commits that were in all of the smaller repositories. The big repository will maintain its own revision numbering, so the revision numbers in your smaller repositories will not match the big repository's revision numbering, although the original commit dates will be preserved. This can be really handy for searching or similar actions that you might do from a more global perspective.

However, this approach is not without its caveats. During the import process, one entire repository is imported at a time. All of a particular repository's revisions will be "grouped" together in the big repository. As a result, revision numbers in the big repository will change every time you recreate it, if there was any new activity in the repositories it contains. For instance, revision #1050 in the big repository may parallel revision #500 in Repository X, but if a commit was added to a repository that is imported before it and the big repository is recreated, that revision would now be #1051. Additionally, although all history and dates are preserved in the revisions, the big repository has commits that are not in chronological order since the import was processed by repository. This inconsistent date/commit ordering can be confusing to some repository reporting tools and may actually render those tools useless to you when they are reporting by date.

Filtering by revision. Note that my svnadmin dump statement includes the -r argument, which specifies the beginning and ending revisions to dump. By default, I'm using "0:HEAD", which basically means "dump every revision", or "dump from the first revision to the HEAD, or latest, revision". Changing the beginning and ending revisions can be useful, especially when used with dates instead of actual revision numbers. For instance, you could change the value to {2007-01-01}:{2007-12-31} to only dump revisions that were committed in 2007.

Combining all of your smaller repositories into one big repository after the fact isn't a perfect solution, but it can be handy, and it's really easy to do when you have a script like this ready to run.

Why That Batch For Loop Isn’t Working

Time for another fun foray into Windows batch scripts. Perhaps you've used the FOR /F command to loop through the contents of a file (for instance, perhaps some data that was redirected to a text file from a command). Grab a line, act on its values, and output some text and commands.

Let's set this up. First, we have a data file named SomeAccounts.txt:

Josh
Mary
Suzy
Amanda
Trisha
Ben

Then, we have ProcessAccounts.bat, which we want to just loop through the accounts in the text file, tell us what they are, and tell us the first letter of the account name (just to have something to do):

DOS:
  1. set file=SomeAccounts.txt
  2. FOR /F %%i IN (%file%) DO (
  3. set username=%%i
  4. echo My account, %username%, starts with %username:~0,1%.
  5. )

Except when you do this, you encounter a problem: All of the values from the FOR loop are the same! It's as if the for loop ran the proper number of times, but it just ran on the last record over and over again! See below:

My account, Ben, starts with B.
My account, Ben, starts with B.
My account, Ben, starts with B.
My account, Ben, starts with B.
My account, Ben, starts with B.
My account, Ben, starts with B.

What's actually happening is the FOR loop is indeed running over every line, and setting the variables as instructed, but the results of those variables being altered isn't echoed until the FOR loop is complete, so the last value of the variable is what displays. This wouldn't be a problem if you were just using your FOR parameter, in this case %%i, but any variables you set while in the FOR loop, like username, experience this "wait until you're out of the loop" phenomenon.

The fix is simple enough, if you know about it! But I've found the solution to be a bit elusive, which is the whole point of sharing it now.

The key is the setlocal EnableDelayedExpansion command. As explained at ss64.com, making this statement before your FOR loop will enable you to display variables as their value at the moment you're referencing them, or their "intermediate values" while in the middle of the FOR loop. In addition to calling the setlocal command, you then have to reference your variables with the exclamation point (!) rather than percent (%) to indicate that you want to use the intermediate value.

Your script will then look like this:

DOS:
  1. setlocal EnableDelayedExpansion
  2. set file=SomeAccounts.txt
  3. FOR /F %%i IN (%file%) DO (
  4. set username=%%i
  5. echo My account, !username!, starts with !username:~0,1!.
  6. )

It will now happily act as desired, outputting these results:

My account, Josh, starts with J.
My account, Mary, starts with M.
My account, Suzy, starts with S.
My account, Amanda, starts with A.
My account, Trisha, starts with T.
My account, Ben, starts with B.

Free Command-Line Zip on Windows

Both Linux and Mac OS X have zip, gzip, and bzip2 command-line tools. What about Windows? If you're trying to do some scripting to automate some archiving or backup, and you want it to be a classic, WinZip-compatible .zip file, how can you do it?

WinZip offers a WinZip Command Line Add-on free of charge--if you already own a copy of WinZip Pro!

You shouldn't have to pay for command-line zip. And you don't have to. Enter Info-ZIP. This workgroup has been maintaining free, portable, high-quality versions of zip and unzip. They have plenty of command-line arguments like you would expect from an open source project.

So, with this project's executables in your system path, you can write up a batch file that is executed as a Windows scheduled task. Maybe something like this:

DOS:
  1. zip -q -S -r c:pathMyBackup.zip c:data -i@include.lst

This will zip the c:data directory. Arguments: -q to do it quietly, -S to include system files, -r to recurse into subdirectories. Finally, use -i to point to a file that indicates the exact files to include, by means of a carriage return delimited list.

You can alternatively use -x to specify only which files should be excluded. Perhaps something like this:

DOS:
  1. zip -q -S -r c:pathMyBackup.zip c:data -x@exclude.lst

The command-line flags are all optional, of course. This tool is certainly a must-have for the Windows scripter.

  Theme Brought to you by Directory Journal and Elegant Directory.