Eli Zaretskii : > Why this doesn't worry you, > and why you still refuse to accept that maybe, just maybe, this is a > lot of effort for a relatively small gain, is beyond me. If this is > in any way indicative of the other problematic issues of the > conversion, then "Houston, we have a problem", indeed. What I refuse to accept is doing a job that is below my standards of quality, if I'm going to do it at all. You cannot argue me out of that by telling me it's too much work, because I simply don't accept that as a valid reason to settle for slipshod results. Instead, I upgrade my tools. > I found that at least these ones are missing: > > lisp/ChangeLog.15 references 103083 > lisp/ChangeLog.16 references 103471 and 107149 > src/ChangeLog.12 references 104015 and 103913 Thank you for finding these. This is a useful bug report. To illustrate my methods, I fixed this by adding those revnos to the ChangeLog section of the map file I enclosed in my last mail (it is the file FOSSILS in my conversion directory). Then I ran a Python script called 'decorate.py' that patched in the corresponding action stamps. The point is that I didn't have to do the lookup by hand; the fixup took less time to do than to describe. The map that decorate.py uses is in turn generated by a second script, bzrlog2map, that filters the putput of bzr log --levels 0 into an association between revnos and action stamps. Here are the first few lines: 116082 2014-01-20T16:55:28Z!eggert@cs.ucla.edu 116081 2014-01-20T16:47:41Z!eggert@cs.ucla.edu 116080 2014-01-20T08:52:44Z!juri@jurta.org 116079 2014-01-20T08:45:56Z!juri@jurta.org 116078 2014-01-20T08:15:16Z!eggert@cs.ucla.edu 116077 2014-01-20T07:56:28Z!eggert@cs.ucla.edu 116076 2014-01-20T01:21:18Z!rgm@gnu.org 116075 2014-01-20T00:54:19Z!rgm@gnu.org 116074 2014-01-19T16:59:51Z!rudalics@gmx.at 116073 2014-01-19T15:42:48Z!eliz@gnu.org 116072 2014-01-19T13:28:16Z!handa@gnu.org 115426.2.11 2014-01-19T13:27:34Z!handa@gnu.org 115426.2.10 2014-01-19T13:26:21Z!handa@gnu.org 115426.2.9 2014-01-19T12:42:37Z!handa@gnu.org 115426.2.8 2014-01-18T00:24:14Z!handa@gnu.org 115426.2.7 2014-01-18T00:24:03Z!handa@gnu.org This is MAP in my conversion directory; I rebuild it occasionally to be sure new revs are included, The point of having two maps rather than one is this: at some point I'm going to mechanically compile FOSSILS into a list of reposurgeon commands. For example, this: ChangeLog: revno 108687 -> 2012-06-22T21:17:42Z!eggert@cs.ucla.edu will become something like this =B & [ChangeLog] filter --replace /\brevno 108687\b/2012-06-22T21:17:42Z!eggert@cs.ucla.edu/ That command translates into English as: Over the set of all blobs in the history with paths containing the string 'ChangeLog', replace 'revno 108687' (preceded and followed by breaking characters) by its corresponding action stamp. I could, in theory, generate a humongous and guaranteed-exhaustive set of these commands directly from MAP. If I did that, though, the conversion day script might would many hours to run, most of that spent on generated commands that are no-ops. There could also be unhappiness related to revision numbers short enough to false-match numeric tokens that are nothing of the kind. Instead, FOSSILS both drives and documents the minimum set of changes required. The cost is that I have to maintain the list of source tokens to be replaced partly by hand. This is normal and acceptable; I often deal with similar issues in Subversion repositories. > It sounds like the scripts or methods you are using to find such > references are not catching some of them. E.g., bare numbers, without > any leading "r" or "revno:" etc. are mostly (or maybe completely) > missing. Looking at bzrlog2map, I see you're right. One of my to-do items was to add to it a scanner that would turn up likely reference-string candidates. I forgot I hadn't actually done that yet. > Given this quality, I once again question the need for all this work. That is incoherent. Whether the work is needed has *nothing* to do with whether it is well implemented yet. > If we cannot guarantee coverage very close to 100%, what would be the > value of such a partial conversion? Exactly proportional to the coverage, of course. Every single reference that is easily chased by human eyeball or indexing tool (e.g. *not* a cookie that is meaningless because its context is gone) increases the utility of the conversion. Complete transparency of reference is best; more is better than less; partial is better than none. The history is too messy for us to get 100% coverage (too many external CVS references), but that is not an argument that we should settle for zero. > More importantly, do we have > reasonably effective methods of QA for the results? The omissions I > discovered are based on simple bzr commands followed by manual > inspection (to avoid quite a few false positives); unless we can come > up with better ways that don't involve manual labor, the overall > quality will not be high enough, as manual labor is inherently error > prone. This is why I explained my workflow. Once a reference has been identified and put in FOSSILS, none of the remaining steps are vulnerable to human error. (My scripts could have bugs, of course. But they're not very complex, so we can have reasonably high confidence in them.) > Btw, what about references to repositories of other projects? Here's > one example (from trunk): > > revno: 110764.1.388 > committer: Bastien Guerry > branch nick: emacs-24 > timestamp: Tue 2013-01-08 19:49:37 +0100 > message: > Merge Org up to commit 4cac75153. Some ChangeLog formatting fixes. > > Are we going to replace the git sha1 here by something more universal? No, because there is no notation and no resolution protocol for such references. If there were such a thing, I would be right on top of using it. Actually, if there were such a thing, it would more than likely have been my invention to begin with... > If so, there's much more work around the corner; if not, why does it > make sense to insist on doing that for Emacs's own branches? Because that *can* be done, and every successful internalization adds utility by (a) removing an impediment to browsing and (b) documenting a causal link. > See above: this is just the tip of the iceberg. I think you will find > much more of such references, with Org, CEDET, MH-E, and Gnus being > the most frequent ones. Doesn't leaving those out of this conversion > undermine the goal? Yes, of course it does. Don't let the unachievable perfect be the enemy of the achievable good! (Damn, now you've started me thinking about prefixing action stamps with name lookups to a registry of repositories. If I invent a practical solution to this it's going to be partly your fault...) > I thought a "changeset" was well defined in the context of a VCS. In modern VCSes like Bazaar, hg, and git, yes, it is a well-defined concept. This conversion creates confusing cases for two reasons. One is the vagaries of CVS; the other is ChangeLog entries, which carry some of the semantic freight of VCS changesets without having the atomicity and time-locality properties that they automatically have when the VCS actually implements them. The result is that one Emacs/Zaretskii "changeset" usually corresponds to one modern VCS changeset, but not always. When the correspondence breaks down, one Emacs/Zaretskii "changeset" maps to two or more VCS changesets, one of which is likely to be a Changelog entry that is semantically bound to the others but a singleton changeset that the VCS doesn't know is connected to them. > My definition is a set of changes made as part of working on a single > isolated issue. IOW, what would have constituted a single indivisible > commit with our current procedures. The Bazaar portion of the history isn't the problem, the CVS part is. There are many instances in the CVS part of the Emacs history that look something like this: 1. Eli changes file A and commits it 2. Eli changes file B and commits it with an identical change comment. 3. Eric changes file C and commits it 4. Eli commits a ChangeLog entry describing the A and B changes 5. Eric commits a ChangeLog entry describing the C changes In your terms, there are two changesets here: {1,2,4} and {3,5). But when parsecvs runs, the result will probably look like this: Changeset 1 - {1,2} Changeset 2 - {3} Changeset 3 - {4} Changeset 4 - {5} Changesets 1 and 3 don't get joined because the intervening commit prevented parsecvs from recognizing that they should be coalesced. (Actually the behavior is a little better than this: parsecvs did coalescence by branch, so if commit 3 is on a different branch than 1 and 2 the right thing will happen.) Here's where the vagaries of CVS come in. For various stupid random CVS-is-brain-damaged reasons there may have been enough skew between the recorded commit times of 1 and 2 that *they* don't get coalesced, even though that's what notional-Eli intended. *That* kind of defect (eligible commits that didn't fit inside too small a time window) is what reposurgeon was originally designed to fix. These are very, *very* common in crappy CVS lifts, and reposurgeon can fix them automatically. There is another case common in the Emacs history that can be coalesced. That is: a file modification immediately followed by a ChangeLog change describing it - but with an empty change comment on the ChangeLog change, which parcecvs refuses to consider matching to anything else. These do have to be fixed up by hand. I haven't tried yet. > From a cursory look I had at the current git mirror, no coalescing was > done there. But perhaps I'm missing something; Andreas, can you > please comment on this? Look for commits that predate the Bazaar transition but change multiple files. You'll find parsecvs made those. > Can we take a real-life use case, please? Please show the cliques > produced by your analysis in this range of bzr revisions on the trunk: > 39997..40058. You can see the details with these bzr commands: > > . This will show a 1-line summary for every revision in the range: > > bzr log --line -r39997..40058 > > . This will show the full commit messages and other meta-data of a > single revision, 40000 in the example (can also be used with a > range -rNNN..MMM): > > bzr log --long --show-ids -c40000 > > . This will show the files modified/added/deleted by a single > revision (can also be used with a range -rNNN..MMM): > > bzr status -c40000 > > The above range of revisions shows a typical routine of commits when > Emacs was using CVS; in particular, "*** empty log message ***" are > most probably ChangeLog commits which usually followed commits of the > files whose log entries are in the ChangeLog change. Note that the > commit messages are almost always different (they are actually the > ChangeLog entries for the files being committed), although the changes > belong to the same changeset. Also note how commits by different > people working on separate changesets sometimes overlap, as in > revisions 40033..40038. > > How will these be handled during your proposed conversion? And what > will be the commit messages of the coalesced commits? I think the example I showed above explains most of this. I'd have to grovel through all the timestamps to find out if automatic coalescence would catch any of the cliques in your span, but I can say that (for example) this: 40050: Miles Bader 2001-10-19 *** empty log message *** 40049: Miles Bader 2001-10-19 Exit if we can't find some variable. looks like something the "lint" command in reposurgeon would catch. I would then eyeball it to check that 40050 is the changelog tweak describing 40049 and write something like this into the lift script: <40049>..<40050> squash --pushback The effect would be to merge 40050's Changelog fileop into 40049, which would keep its comment. The children and parents of the sequashed commit would be what you think. And yes, <40049> would be a legal commit reference in reposurgeon. Provided I did this first: read fossils > In a properly done conversion, file ignores don't abruptly stop working > > bevcause you browsed back past the point of conversion and what should > > be .gitignore files are nmow .bzrignores or .cvsignores. > > So you will be adding .gitignore to revisions where there was none? > If not, how do you plan on attacking this issue? By converting .bzrignore files in place to .gitignores. > If you really want to build confidence in your methods and tools, some > kind of statistics about the conversion jobs done using them, and the > time passed since the conversion would probably be a good start. I can tell you the most important statistics. For three years of doing conversions on projects including GPSD, NUT, Hercules, Roundup, Battle For Wesnoth, robotfindskitten, groff, and several others, I can tell you three numbers: 1. Time passed since conversion: tops out at 3 years for GPSD, about 2 years each for NUT and Hercules. 2. Number of defects I found myself after delivering a final conversion: three. (All in Battle For Wesnoth. Two CVS usernames didn't get properly mapped to git-style IDs because the attribution file I was using at conversion time was incomplete.) 3. Number of defects subsequently reported by project dev groups: zero. Yes, *zero*. One of the dev groups (Roundup, for which I did SVN->git) later moved to hg for political reasons. Otherwise those repositories are still in active use by multiple developers, and have been for a cumulative hundreds of thousands of hours. I won't represent that I think none of my finished conversions has ever had an error; that would be highly unlikely. What is true is that any errors they had were so minor that nobody has thought it was worth bugging me about them. As a matter of history, GPSD and Hercules were early test conversions. NUT (Network UPS tools) was reposurgeon's trial by fire; I went into that with a usable beta-grade tool, came out of it with something good enough that the much bigger and nastier Blender conversion could be done by *people who weren't me*. By the time I did groff, late last year, my tools and procedures for normal cases were pretty well routinized and bulletproofed. You can read about them here: DVCS Migration HOWTO: http://www.catb.org/esr/dvcs-migration-guide.html There's even a makefile that semi-automates the conversion steps. That said, Emacs is a bit abnormal. The kind of case I'm used to handling is Subversion repo with a fossil layer of CVS, having on the close order of a decade of history and a commit count in the 3K-30K range (this describes GPSD, NUT, Hercules, Roundup, BfW). The Emacs history is significantly longer and a bit cruftier than these, and I've never dealt with a layer of Bazaar before. Thes differences do complicate things a bit (I don't normally have to write custom scripts) but not unmanageably so. > (Yes, time since conversion is important because the problems are > usually subtle and don't stick out until much later.) Detailed > description of the planned steps during the conversion and how you > intend to control the quality of each step, will also be appreciated. I'm enclosing a current copy of the lift script. I'll add more steps as I verify them. As for how I intend to QA them - my strategy has two prongs. One is automating everything I can so that I have conditional guarantees of the form "if tool X is correct, then my results are correct". The other: historically, I've usually worked in collaboration with a Mr. Inside, a senior project dev, who checked my work in progress from a position of intimate knowledge of the project history. Congratulations, I think you've elected yourself for that job. The reposurgeon manual is here: http://www.catb.org/~esr/reposurgeon/reposurgeon.html > This is great, but doesn't really address the worrisome aspects of the > conversion we care about. We no longer care about the elpa branch in > the bzr repository. We do care about the few other branches, such as > emacs-24. And it is not even clear what will become of those after > the conversion; the reposurgeon man page cites a limitation related to > that, allegedly stemming from some (imaginary) bzr confusion between > branches and repositories, but ends up saying nothing about the > branches after the conversion. Will they end up in a single git > repository, like any other git branches, or won't they? Will the > merges between those branches show up as expected in git DAG? How > will merges from external branches (such as Org or MH-E) or from local > feature branches be represented? Those are much more important issues > than the ability to split elpa. You get to tell me what you want to have happen, Mr. Inside. If reposurgeon isn't powerful enough to do it, I'll up-gun it until it is. Preliminary answer: the git repo after conversion day will, globally speaking, have the same DAG that the git mirror did before. Changes will be localized and consist of (a) commit-clique squashes, and (b) a few junk branch deletions. Bazaar's very real branch/repo confusion is probably not relevant, because my conversion procedure never deals with the Bazaar repository directly. I start from Andreas's git mirror, which is (presumably) replicating the branch structure of the entire Bazaar repo every 15 minutes. If that isn't true, we have some additional problems to solve that have nothing to do with my tools. -- Eric S. Raymond