Separating a Large Repository

A few months ago, I posted an article about combining multiple Subversion repositories into one large repository. Some folks have expressed an interest in doing the opposite--separating one large repository into multiple smaller repositories. The process is not without its quirks, but it can be done.

At first glance, you'd conclude the process would work much the same way: Loop through the individual directories in the large repository, create smaller repositories for each one, then dump and import the contents of each directory into its small repository.

The tricky part is that the Subversion dump command dumps everything in the repository, by revision. In order to pull just a single directory, you must filter a complete dump with the "svndumpfilter" command. This blog post by AllMyBrain.com basically explains how to accomplish this in Linux. I usually have to work on a Windows box on the job, so I wrote up a script to accomplish this in a Windows batch script.

The strategy is the same as the Linux script, though. We're going to use "svnadmin dump" the large repository, then use "svndumpfilter" to filter by just the directory we want, then "svnadmin load" the results into the newly created repository. All of this can be combined into a single statement via piping:

DOS:
  1. svnadmin dump c:\my\large\repo\ |
  2. svndumpfilter include MyDirectory |
  3. svnadmin load MySmallRepo\MyDirectory

This will make a little more sense when we look at the full script. Let's just put it out there and then go through it.

DOS:
  1. SET SmallRepoPath=c:\SmallRepos
  2. SET PathToRepo=c:\BigRepo
  3. SET UNCToRepo=file:///c:/BigRepo
  4. SET PathToChkout=c:\BigRepoChkout
  5.  
  6. mkdir %PathToChkout%
  7. svn co %uncToRepo% %PathToChkout% --ignore-externals
  8. dir /A:D /B %PathToChkout%> %PathToChkout%\dirs.tmp
  9. for /F %%i in (%PathToChkout%\dirs.tmp) do (
  10.     if not %%i==.svn (
  11.         echo Processing "%%i"...
  12.         mkdir %SmallRepoPath%\%%i
  13.         svnadmin create %SmallRepoPath%\%%i
  14.         svnadmin dump %PathToRepo% | svndumpfilter include %%i | svnadmin load %SmallRepoPath%\%%i
  15.     )
  16. )
  17. del %PathToChkout%\dirs.tmp
  18. rmdir /S /Q %PathToChkout%

First, we're setting our paths. "SmallRepoPath" will be the directory holding all of the small repositories we'll be creating. "PathToRepo" and "UNCToRepo" point to the big repository as DOS and UNC paths, respectively. "PathToChkout" points to a Subversion checkout of the large repository.

First, we check out the large repository with the "svn co" command. We do this just so that we can call the "dir /A:D /B" command, which says, "List just the directories in the checkout directory." We use that output to loop through each directory in the large repository.

Then, for each directory in the large repository, we create a corresponding small repository, then do our dump/filter/load combo. Again, we're dumping the contents of the large repository, using "svndumpfilter" to filter by directory, then loading that filtered dump into the new small repository.

Finally, we just do some cleanup by removing our temp files and the checkout directory.

There are a few caveats with this code.

First, it will import all of the large repository's revisions into the smaller repository. There are svndumpfilter arguments to prevent this, such as --drop-empty-revs and --renumber-revs, but I found the Windows Subversion binaries to be problematic with these arguments. The end result is that you have more revision numbers than needed, but only the relevant data is actually imported into the repository, and viewing logs on just the imported directory will still obviously show revision logs related to that directory, so there's really little harm done.

Second, the dump/filter/load action doesn't always work on a directory that has been moved (copied/deleted) from another location within the large repository. What's worse, it won't fail, it just won't load any data into the small repository. To address this, use the --revision argument on the "svnadmin dump" command to do a dump starting at a revision after the move took place. Doing so will give the "svndumpfilter" command something it can work with.

This process is certainly more complicated to explain, but ultimately there's not that much more going on. Hopefully this explanation is helpful to you.

Combining Repositories Into One Large Repository

I keep all my projects in separate Subversion repositories. I did this because it feels a lot cleaner this way, there is less risk in the event of repository corruption, and I use corresponding Trac projects that I also wanted to keep separate from one project to the next.

That said, there are advantages to having one single repository. No big deal, that can be done after the fact with code.

Here is some Windows code to combine all the repositories in a directory into a single big repository:

DOS:
  1. set svndir=c:\Test\svn
  2. set bigrepo=c:\Test\BigRepo
  3. set bigrepoUNC=file:///c:/Test/BigRepo
  4. set rev=0:HEAD
  5.  
  6. echo Setting up the big repository.
  7. rmdir /S /Q %bigrepo%
  8. mkdir %bigrepo%
  9. svnadmin create %bigrepo%
  10.  
  11. cd %svndir%
  12. dir /A:D /B> dirs.tmp
  13. for /F %%i in (dirs.tmp) do (
  14. echo Adding %svnDir%\%%i to the big repository:
  15. svnadmin dump -r %rev% %%i >  %%i.dmp
  16. svn mkdir -m "Making project directory %%i." --non-interactive %bigrepoUNC%/%%i
  17. svnadmin load %bigrepo% --parent-dir %%i  < %%i.dmp
  18. del /F /Q %%i.dmp
  19. )
  20. del dirs.tmp

There's really not much happening here; the process is simple. First, we create the new "big" repository with the svnadmin create statement. Second, we loop through the directory, processing each Subversion repository in the directory with a three-step process: (a) Dump the repository with the svnadmin dump statement into a temporary *.dmp file. (b) Explicitly add a new directory in the "big" repository for the current repository we're processing, with the svn mkdir statement. (c) Import the dump into the "big" repository with the svnadmin load statement. Really, the rest of the code is just looping, commenting, or cleanup code.

What have we produced? As you might expect, we now have one big repository that has all of the files and commits that were in all of the smaller repositories. The big repository will maintain its own revision numbering, so the revision numbers in your smaller repositories will not match the big repository's revision numbering, although the original commit dates will be preserved. This can be really handy for searching or similar actions that you might do from a more global perspective.

However, this approach is not without its caveats. During the import process, one entire repository is imported at a time. All of a particular repository's revisions will be "grouped" together in the big repository. As a result, revision numbers in the big repository will change every time you recreate it, if there was any new activity in the repositories it contains. For instance, revision #1050 in the big repository may parallel revision #500 in Repository X, but if a commit was added to a repository that is imported before it and the big repository is recreated, that revision would now be #1051. Additionally, although all history and dates are preserved in the revisions, the big repository has commits that are not in chronological order since the import was processed by repository. This inconsistent date/commit ordering can be confusing to some repository reporting tools and may actually render those tools useless to you when they are reporting by date.

Filtering by revision. Note that my svnadmin dump statement includes the -r argument, which specifies the beginning and ending revisions to dump. By default, I'm using "0:HEAD", which basically means "dump every revision", or "dump from the first revision to the HEAD, or latest, revision". Changing the beginning and ending revisions can be useful, especially when used with dates instead of actual revision numbers. For instance, you could change the value to {2007-01-01}:{2007-12-31} to only dump revisions that were committed in 2007.

Combining all of your smaller repositories into one big repository after the fact isn't a perfect solution, but it can be handy, and it's really easy to do when you have a script like this ready to run.

How to Fix 301 Error for Subversion Checkouts

My Linux box was hosting Subversion with no problem. I added a new repository to the several that were already present, and when I checked it out, it said, "301 Moved Permanently". Excuse me?

As it turns out, there is a 301 error section in the Subversion FAQs. It says that this typically means your Apache configuration is invalid (nope, the rest of my repositories worked just fine) or your repository has the same path as a literal directory on your web root. Ahhh!

Sure enough, my subversion path was http://myserver.com/xyz/, and I had a literal directory named "xyz" in the web root. I changed that directory name, and Subversion would then allow me to checkout the repository with no problem.

Upgrading Subversion Requires a Bindings Update for Trac!

My Subversion/Trac server was at Trac v0.9.6 and Subversion v1.3.x because those were the latest stable releases when I set up the server. I decided it would be relatively quick and painless to at least get the latest version of Subversion (v1.4.5) installed since I didn't see anything on the web about Trac v0.9.6 being incompatible with newer Subversion builds.

Using the Windows binary installer, I had no problem installing Subversion v1.4.5 on the server. I tested everything and Subversion still worked, it showed the new version when accessing via web access, and Trac still worked fine.

Alas: Don't forget that an upgraded version of Subversion will not upgrade your repository. It will upgrade a working copy of a checked-out repository, but it will not upgrade the repository itself.

That said, I was unaware of one more step that you must take to upgrade Subversion on a Subversion/Trac setup: You must also upgrade the Python bindings to Subversion.

This became apparent the next time I created a new repository, which was not a v1.4.x repository, and when I tried to build a Trac environment to point to it, Trac got upset because of the classic "Expected version '3' of repository; found version '5'" error. To fix this, you must set up new bindings to the new version of Subversion, as explained in the TracSubversion page.

Now, I obviously love Subversion and I love Trac, but honestly, straight-forward documentation that is easy to understand for someone who doesn't want to get in the thick of it isn't really the strong suit for these communities, at least when it comes to installation and deployment on the server. What exactly it means--and how to do it--when they say, "Update the Subversion bindings" is not easy to understand. However, the solution is simple. All that is needed is to download the appropriate "svn-python" Windows installer that matches your version of Subversion and Python (in my case, 1.4.5 and py2.3) and run it on the server.

In my case, I had to stop Apache for the installation to succeed. Upon restarting Apache, everything worked great.

  Theme Brought to you by Directory Journal and Elegant Directory.