Separating a Large Repository

A few months ago, I posted an article about combining multiple Subversion repositories into one large repository. Some folks have expressed an interest in doing the opposite--separating one large repository into multiple smaller repositories. The process is not without its quirks, but it can be done.

At first glance, you'd conclude the process would work much the same way: Loop through the individual directories in the large repository, create smaller repositories for each one, then dump and import the contents of each directory into its small repository.

The tricky part is that the Subversion dump command dumps everything in the repository, by revision. In order to pull just a single directory, you must filter a complete dump with the "svndumpfilter" command. This blog post by AllMyBrain.com basically explains how to accomplish this in Linux. I usually have to work on a Windows box on the job, so I wrote up a script to accomplish this in a Windows batch script.

The strategy is the same as the Linux script, though. We're going to use "svnadmin dump" the large repository, then use "svndumpfilter" to filter by just the directory we want, then "svnadmin load" the results into the newly created repository. All of this can be combined into a single statement via piping:

DOS:
  1. svnadmin dump c:\my\large\repo\ |
  2. svndumpfilter include MyDirectory |
  3. svnadmin load MySmallRepo\MyDirectory

This will make a little more sense when we look at the full script. Let's just put it out there and then go through it.

DOS:
  1. SET SmallRepoPath=c:\SmallRepos
  2. SET PathToRepo=c:\BigRepo
  3. SET UNCToRepo=file:///c:/BigRepo
  4. SET PathToChkout=c:\BigRepoChkout
  5.  
  6. mkdir %PathToChkout%
  7. svn co %uncToRepo% %PathToChkout% --ignore-externals
  8. dir /A:D /B %PathToChkout%> %PathToChkout%\dirs.tmp
  9. for /F %%i in (%PathToChkout%\dirs.tmp) do (
  10.     if not %%i==.svn (
  11.         echo Processing "%%i"...
  12.         mkdir %SmallRepoPath%\%%i
  13.         svnadmin create %SmallRepoPath%\%%i
  14.         svnadmin dump %PathToRepo% | svndumpfilter include %%i | svnadmin load %SmallRepoPath%\%%i
  15.     )
  16. )
  17. del %PathToChkout%\dirs.tmp
  18. rmdir /S /Q %PathToChkout%

First, we're setting our paths. "SmallRepoPath" will be the directory holding all of the small repositories we'll be creating. "PathToRepo" and "UNCToRepo" point to the big repository as DOS and UNC paths, respectively. "PathToChkout" points to a Subversion checkout of the large repository.

First, we check out the large repository with the "svn co" command. We do this just so that we can call the "dir /A:D /B" command, which says, "List just the directories in the checkout directory." We use that output to loop through each directory in the large repository.

Then, for each directory in the large repository, we create a corresponding small repository, then do our dump/filter/load combo. Again, we're dumping the contents of the large repository, using "svndumpfilter" to filter by directory, then loading that filtered dump into the new small repository.

Finally, we just do some cleanup by removing our temp files and the checkout directory.

There are a few caveats with this code.

First, it will import all of the large repository's revisions into the smaller repository. There are svndumpfilter arguments to prevent this, such as --drop-empty-revs and --renumber-revs, but I found the Windows Subversion binaries to be problematic with these arguments. The end result is that you have more revision numbers than needed, but only the relevant data is actually imported into the repository, and viewing logs on just the imported directory will still obviously show revision logs related to that directory, so there's really little harm done.

Second, the dump/filter/load action doesn't always work on a directory that has been moved (copied/deleted) from another location within the large repository. What's worse, it won't fail, it just won't load any data into the small repository. To address this, use the --revision argument on the "svnadmin dump" command to do a dump starting at a revision after the move took place. Doing so will give the "svndumpfilter" command something it can work with.

This process is certainly more complicated to explain, but ultimately there's not that much more going on. Hopefully this explanation is helpful to you.

Shrink the Unshrinkable SQL Transaction Log

Various reasons may cause SQL Server to get in a rut and not empty the transaction log of a database. In my case, our database backups were failing without our knowledge for several weeks, so the backups were never successful, and the transaction logs of a few databases grew so large that the backup process would still not clear out the transaction log. In one case, we had a 187MB database with a 37GB transaction log!

The insanity had to stop! A handful of databases like this would put us over the top on that particular server's hard drive storage.

The SQL Server GUI for shrinking the database rendered no effect, and even using the DBCC SHRINKFILE command was not working.

The key, as explained by Pinal Dave, is to run the SHRINKFILE command twice, with an explicit backup log truncation in between both runs. This code here will get you up and running:

SQL:
  1. DBCC SHRINKFILE("MyDatabase_Log", 1)
  2. BACKUP LOG MyDatabase WITH TRUNCATE_ONLY
  3. DBCC SHRINKFILE("MyDatabase_Log", 1)

This freed up dozens of gigabytes on our server.

Combining Repositories Into One Large Repository

I keep all my projects in separate Subversion repositories. I did this because it feels a lot cleaner this way, there is less risk in the event of repository corruption, and I use corresponding Trac projects that I also wanted to keep separate from one project to the next.

That said, there are advantages to having one single repository. No big deal, that can be done after the fact with code.

Here is some Windows code to combine all the repositories in a directory into a single big repository:

DOS:
  1. set svndir=c:\Test\svn
  2. set bigrepo=c:\Test\BigRepo
  3. set bigrepoUNC=file:///c:/Test/BigRepo
  4. set rev=0:HEAD
  5.  
  6. echo Setting up the big repository.
  7. rmdir /S /Q %bigrepo%
  8. mkdir %bigrepo%
  9. svnadmin create %bigrepo%
  10.  
  11. cd %svndir%
  12. dir /A:D /B> dirs.tmp
  13. for /F %%i in (dirs.tmp) do (
  14. echo Adding %svnDir%\%%i to the big repository:
  15. svnadmin dump -r %rev% %%i >  %%i.dmp
  16. svn mkdir -m "Making project directory %%i." --non-interactive %bigrepoUNC%/%%i
  17. svnadmin load %bigrepo% --parent-dir %%i  < %%i.dmp
  18. del /F /Q %%i.dmp
  19. )
  20. del dirs.tmp

There's really not much happening here; the process is simple. First, we create the new "big" repository with the svnadmin create statement. Second, we loop through the directory, processing each Subversion repository in the directory with a three-step process: (a) Dump the repository with the svnadmin dump statement into a temporary *.dmp file. (b) Explicitly add a new directory in the "big" repository for the current repository we're processing, with the svn mkdir statement. (c) Import the dump into the "big" repository with the svnadmin load statement. Really, the rest of the code is just looping, commenting, or cleanup code.

What have we produced? As you might expect, we now have one big repository that has all of the files and commits that were in all of the smaller repositories. The big repository will maintain its own revision numbering, so the revision numbers in your smaller repositories will not match the big repository's revision numbering, although the original commit dates will be preserved. This can be really handy for searching or similar actions that you might do from a more global perspective.

However, this approach is not without its caveats. During the import process, one entire repository is imported at a time. All of a particular repository's revisions will be "grouped" together in the big repository. As a result, revision numbers in the big repository will change every time you recreate it, if there was any new activity in the repositories it contains. For instance, revision #1050 in the big repository may parallel revision #500 in Repository X, but if a commit was added to a repository that is imported before it and the big repository is recreated, that revision would now be #1051. Additionally, although all history and dates are preserved in the revisions, the big repository has commits that are not in chronological order since the import was processed by repository. This inconsistent date/commit ordering can be confusing to some repository reporting tools and may actually render those tools useless to you when they are reporting by date.

Filtering by revision. Note that my svnadmin dump statement includes the -r argument, which specifies the beginning and ending revisions to dump. By default, I'm using "0:HEAD", which basically means "dump every revision", or "dump from the first revision to the HEAD, or latest, revision". Changing the beginning and ending revisions can be useful, especially when used with dates instead of actual revision numbers. For instance, you could change the value to {2007-01-01}:{2007-12-31} to only dump revisions that were committed in 2007.

Combining all of your smaller repositories into one big repository after the fact isn't a perfect solution, but it can be handy, and it's really easy to do when you have a script like this ready to run.

Using SQL to Retrieve SQL

At work, someone made a request that required me to look through potentially hundreds of views in dozens of databases on our SQL Server. I certainly didn't want to examine each one at a time. How could I speed up this process with code?

Well, you can find all of your views by querying the sysobjects table, and you can retrieve the SQL behind the views by querying the syscomments table. Something like this works well:

SQL:
  1. SELECT RTrim(sysobjects.name) AS ViewName,
  2.        RTrim(syscomments.text) AS ViewSQL
  3. FROM   sysobjects JOIN syscomments
  4. ON     syscomments.id=sysobjects.id
  5. WHERE  sysobjects.xtype='V' AND sysobjects.category=0

This will retrieve the SQL code and names of all the views in the current database. This simple query is the heart of the solution. But I would like to retrieve this information for all of the databases.

Well, it's easy enough to get a list of all the databases. The sysdatabases table in the master database has that list. You can query that table, perhaps filtering out some of the system or sample databases included with SQL Server:

SQL:
  1. SELECT name FROM master.dbo.sysdatabases
  2. WHERE name NOT IN ('tempdb','master','msdb','pubs','model')
  3. ORDER BY name

Now just combine this information. To accomplish this, we'll build a stored procedure that will create a temporary table, loop through the databases and query each one for its views, insert the view information into the temporary table, and return the temporary table.

Something like this will do the trick:

SQL:
  1. CREATE PROC dbo.selectViews   AS
  2. BEGIN
  3.  
  4. -- Vars
  5. DECLARE @dbname sysname
  6.  
  7. -- Temp Table
  8. CREATE TABLE #Results
  9. (  
  10.   DatabaseName varchar(200),
  11.   ViewName varchar(200),
  12.   ViewText nvarchar(4000)
  13. )
  14.  
  15. -- Loop Thru the Databases.
  16. DECLARE dbnames_cursor CURSOR
  17. FOR
  18.   SELECT name FROM master.dbo.sysdatabases
  19.   WHERE name NOT IN ('tempdb','master','msdb','pubs','model')
  20.   ORDER BY name
  21. OPEN dbnames_cursor
  22. FETCH NEXT FROM dbnames_cursor INTO @dbname
  23. WHILE (@@FETCH_STATUS &lt;&gt; -1)
  24. BEGIN
  25.   IF (@@FETCH_STATUS &lt;&gt; -2)
  26.   BEGIN   
  27.     -- Grab the Views of this Database and Put them in the Temp Table.
  28.     SET @dbname = RTRIM(@dbname)
  29.     INSERT INTO #Results
  30.     EXECUTE
  31.     (
  32.       'SELECT '''+@dbName+''' as DatabaseName, ' +
  33.       'RTrim(' + @dbname + '.dbo.sysobjects.name) as ViewName, ' +
  34.       'RTrim(' + @dbname + '.dbo.syscomments.text) as ViewText ' +
  35.       'FROM ' + @dbName + '.dbo.sysobjects join ' + @dbName + '.dbo.syscomments ' +
  36.       'ON ' + @dbName + '.dbo.syscomments.id=' + @dbName + '.dbo.sysobjects.id ' +
  37.       'WHERE ' + @dbname + '.dbo.sysobjects.xtype=''V'' and ' + @dbname + '.dbo.sysobjects.category=0 '
  38.     )
  39.   END
  40.   FETCH NEXT FROM dbnames_cursor INTO @dbname
  41. END
  42. CLOSE dbnames_cursor
  43. DEALLOCATE dbnames_cursor
  44. SELECT * FROM #Results order by DatabaseName, ViewName
  45. DROP TABLE #Results
  46.  
  47. END
  48. --

Now just execute the stored procedure and review its output.

SQL:
  1. exec selectViews

This portion of the solution just retrieves the data. After writing this, I developed a really short and simple ColdFusion application that would output the database name, view name, and SQL to a table, and used some simple JavaScript to make it easier to search and filter the views. The client-side methods used to view and work with the data are obviously up to you.

  Theme Brought to you by Directory Journal and Elegant Directory.