Easy data set
Consider first the simpler problem: To increment a number in register A by one. In general such a number has binary representation x01^{n}, where x is an arbitrary string of ones and zeroes and notation d^{k} means k consecutive digits d (so 1^{n} is n ones). Clearly by incrementing this number by one, we get x10^{n}.
In order to perform this operation on our machine we first separate the rightmost (least significant) block of n ones. We do this in the following way:
First we negate the number to get y10^{n} (where y is a bitwise negation of x).
mov B A
not B 
Then we replace y by ones. We can do this by propagating that rightmost one to the left by shling and oring similar to this:
mov C B
shl C 1 or B C 
After this block we have something like y′110^{n} in register B (where y′ is some string of ones and zeroes). Clearly by repeating this block (at least) 64 times we can get 1^{64n}0^{n} in register B. But that will make out program too long. Fortunately we now have (at least) two rightmost ones so we can go faster:
mov C B
shl C 2 or B C 
and we get y′′11110^{n}. Now we have four ones so we can go even faster. Another four blocks will complete the work:
mov C B
shl C 4 or B C mov C B shl C 8 or B C mov C B shl C 16 or B C mov C B shl C 32 or B C 
Now we have 1^{64n}0^{n} in B. That is what we wanted the rightmost ones (in original number) are separated. Some simple operations will finish the incrementation:
Clear out the rightmost ones in A to get x00^{n},
and A B

construct the one on n + 1th place (0^{63n}10^{n}),
mov C B
shl C 1 not C and B C 
and put that one into A.
or A B

And we are done. Almost. Our task was to increment value in A by 8, not 1. However since binary representation of 8 is 1000 the solution is similar. We just shift the number to the right by 3 positions, add 1 and shift it back (while preserving last 3 bits). Sure you can imagine how to do that.
Also note that the above construction of incrementing by one need not work if the result would overflow. However in order to use it to solve our origonal task this is not important since we shift the inpit number to the right (so result of incrementation will not overflow) and then we shift it left (so excess bits will get lost).
Hard data set
First let’s find out how does next permutation of bit sequence look like. Clearly a next permutation of a sequence x011^{m}0^{n} is x100^{n}1^{m}. That is, to get a next permutation we find the rightmost one preceded by a zero, move it one place to the left and push the block of ones right of it far to the right. If there is no one preceded by a zero, then the input is either 0 or its next permutation does not fit in 64bit register.
In order to find the next permutation we separate the blocks of n zeroes and m ones by using the similar method as in solution of easy task:
mov B A
mov C B shl C 1 or B C mov C B shl C 2 or B C mov C B shl C 4 or B C mov C B shl C 8 or B C mov C B shl C 16 or B C mov C B shl C 32 or B C 
Now we have 1^{k}111^{m}0^{n} in B (where k = 62  n  m) and to separate m ones we do
mov C A
not C and C B 
to get y100^{m}0^{n} in C (where y is a bitwise negation of x) and
mov D C
shl D 1 or C D mov D C shl D 2 or C D mov D C shl D 4 or C D mov D C shl D 8 or C D mov D C shl D 16 or C D mov D C shl D 32 or C D 
to get 1^{k}100^{m}0^{n} in C. Now we can clear out the rightmost m ones in A
and A C

and find the one that was originaly the rightmost one preceded by zero and move it one place to the left (it is the rightmost one of C).
mov D C
shl D 1 not D and D C or A D 
Now we have x100^{m+n} in A and 0^{k}100^{n+m} in D. Using the current values on B,C and D we can reconstruct the block of m ones:
shr D 1
or D C not D and D B 
Now we have 0^{k}001^{m}0^{n} in D. All we need is to push them n places to the right. We can do this with the help of value in B. First we negate it to get 0^{64n}1^{n}
not B

and then we will simultaneously shift B and D while there ar som ones in B. This can be done in the following way:
First we check if there is a one in B:
mov E B
and E 1 
Now E is 0 if B has no ones and E is 1 if there are some ones in B. This works because B has form 0^{64n}1^{n}. So we can shift B and D by E.
shr B E
shr D E 
So by these two blocks we can shift D by one place to the right and clear one one from B if there is one. By repeating them (at least) 64 times we can shift D to the right by n places because B originally contained n ones. However this would (probably) make the program too long. Of course we can go faster, for example:
mov E B
shr E 7 and E 1 shl E 3 shr B E shr D E 
This will shift D by 8 places to the right and clear 8 ones from B if there are at least 8 of them. By using 7 blocks similar to above one we can shift D to the right by n places in a more efficient way:
mov E B
shr E 63 and E 1 shl E 6 shr B E shr D E mov E B shr E 31 and E 1 shl E 5 shr B E shr D E mov E B shr E 15 and E 1 shl E 4 shr B E shr D E mov E B shr E 7 and E 1 shl E 3 shr B E shr D E mov E B shr E 3 and E 1 shl E 2 shr B E shr D E mov E B shr E 1 and E 1 shl E 1 shr B E shr D E mov E B shr E 0 and E 1 shl E 0 shr B E shr D E 
And all that remains is ti put those shifted m ones into A
or A D

and we are done.