First, some Emacs regexp basics:

1. If A and B match single characters, then A\|B should be written [AB] whenever possible. The reason is that A\|B adds a backtrack record which uses stack space and wastes time if matching fails later on. The cost can be quite noticeable, which we have seen.

2. Syntax-class constructs are usually better written as character alternatives when possible.

The \sX construct, for some X, is typically somewhat slower to match than explicitly listing the characters to match. For example, if all you care about are space and tab, then "\\s *" should be written "[ \t]*".

3. Unicode character classes are slower to match than ASCII-only ones. For example, [[:alpha:]] is slower than [A-Za-z], assuming only those characters are of interest.

4. [^...] will match \n unless included in the set. For example, "[^a]\\|$" will almost never match the $ (end-of-line) branch, because a newline will be matched by the first branch. The only exception is at the very end of the buffer if it is not newline-terminated, but that is rarely worth considering for source code.

5. \r (carriage return) normally doesn't appear in buffers even if the file uses DOS line endings. Line endings are converted into a single \n (newline) when the buffer is read. In particular, $ does NOT match at \r, only before \n.

When \r appears it is usually because the file contains a mixture of line-ending styles, typically from being edited using broken tools. Whether you want to take such files into account is a matter of judgement; most modes don't bother.

6. Capturing groups costs more than non-capturing groups, but you already know that.

On to specifics: here are annotations for possible improvements in cc-langs.el. (I didn't bother about capturing groups here.)