^

Git commits as a way to find related files

2023-04-22

I've had this code laying around for sometime and although I rarely wake it up to perform its duties whenever I do my face is distorted by a big grin. I liked the idea as it came to me, and I still like it years later.

Now if you work like me and aggressively rebase and fixup commits that end up in rather large commits to your master branch you may not reap the maximum benefits of this package recommendations. In truth, working sloppier and commiting early and often would yield much better recommendations. Still, you will always get some suggestions even if you're a neat-head like myself.

Revisiting this code with the intent on sharing it I noticed it relied on projectile but I've updated the original to use project instead for those who like to keep more closely to the builtins. Enough talk, let's get coding.

Oh and if you don't care about the walk through you can have the source straight up!

Code walk through

First up some basic requires and faces that we will use when displaying the results:

(require 'cl-lib)
(require 'subr-x)
(require 'project)
(require 'vc-git)

(defface git-related-score
 '((t (:foreground "#f1fa8c")))
 "Face used for git related score."
 :group 'git-related)

(defface git-related-file
 '((t (:foreground "#ff79c6")))
 "Face used for git related file name."
 :group 'git-related)

Then we'll define a variable to hold our graphs as we will generate one per visited project as well as some structs to represent a graph, a file and a commit.

(defvar git-related--graphs nil)

(cl-defstruct git-related--graph files commits)

(cl-defstruct git-related--file
 (name "" :type string)
 (commits nil :type list))

(cl-defstruct git-related--commit
 (sha "" :type string)
 (files nil :type list))

Next a function to instantiate a new graph it's going to keep track of files in the repository as well as commits and we make sure we set some largish hash table size for performance reasons:

(defun git-related--new-graph ()
 "Create an empty graph."
 (make-git-related--graph
  :files (make-hash-table :test 'equal :size 2500)
  :commits (make-hash-table :test 'equal :size 2500)))

Recording a commit then requires a graph (one per project, remember?) a sha representing the commit and every filename references by the commit:

(defun git-related--record-commit (graph sha filenames)
 "Record in the GRAPH the relation between SHA and FILENAMES."
 (let ((commit (make-git-related--commit :sha sha)))
  (dolist (filename filenames)
   (let* ((seen-file (gethash filename (git-related--graph-files graph)))
          (file-found (not (null seen-file)))
          (file (or seen-file (make-git-related--file :name filename))))

    (cl-pushnew commit (git-related--file-commits file))
    (cl-pushnew file (git-related--commit-files commit))

    (unless file-found
     (setf (gethash filename (git-related--graph-files graph)) file))))

  (setf (gethash sha (git-related--graph-commits graph)) commit)))

Now we're ready to start constructing the graph from the commits, getting exiting! So we run git log only listing names of files changed and make sure to seperate each commit by a null byte to be able to reliably parse the output. The neat part here is we request everything from git in one operation which turns out to be pretty efficient.

(defun git-related--replay (&optional graph)
 "Replay git commit history into optional GRAPH."
 (let ((graph (or graph (git-related--new-graph))))
  (with-temp-buffer
   (process-file vc-git-program nil t nil
    "log" "--name-only" "--format=%x00%H")
   (let* ((commits (split-string (buffer-string) "\0" t))
          (replay-count 0)
          (progress-reporter
           (make-progress-reporter "Building commit-file graph..."
            0 (length commits))))
    (dolist (commit commits)
     (let* ((sha-and-paths (split-string commit "\n\n" t
                            (rx whitespace)))
            (sha (car sha-and-paths))
            (paths (when (cadr sha-and-paths)
                    (split-string (cadr sha-and-paths) "\n" t
                     (rx whitespace)))))
      (git-related--record-commit graph sha paths)
      (progress-reporter-update progress-reporter
       (cl-incf replay-count))))
    (progress-reporter-done progress-reporter)))
  graph))

With the graph constructed we are ready to find similar files. I'm not going to get into the math here as I suck at math and I am even worse at explaining it but in essence this is calculating cosine simliarity using our graph and returns a sorted list of conses carrying the determined simliarity rank and file.

(defun git-related--similar-files (graph filename)
 "Return files in GRAPH that are similar to FILENAME."
 (unless (git-related--graph-p graph)
  (user-error "You need to index this project first"))
 (let ((file (gethash filename (git-related--graph-files graph))))
  (when file
   (let ((file-sqrt (sqrt (length (git-related--file-commits file))))
         (neighbor-sqrts (make-hash-table :test 'equal :size 100))
         (hits (make-hash-table :test 'equal :size 100)))

    (dolist (commit (git-related--file-commits file))
     (dolist (neighbor (remove file (git-related--commit-files commit)))
      (let ((count (cl-incf
                    (gethash (git-related--file-name neighbor) hits 0))))
       (when (= count 1)
        (setf (gethash (git-related--file-name neighbor) neighbor-sqrts)
         (sqrt (length (git-related--file-commits neighbor))))))))

    (let (ranked-neighbors)
     (maphash
      (lambda (neighbor-name neighbor-sqrt)
       (let ((axb (* file-sqrt neighbor-sqrt))
             (n (gethash neighbor-name hits)))
        (push
         (list (if (cl-plusp axb) (/ n axb) 0.0) neighbor-name)
         ranked-neighbors)))
      neighbor-sqrts)
     (cl-sort
      (cl-remove-if-not #'git-related--file-exists-p
       ranked-neighbors :key #'cadr)
      #'> :key #'car))))))

Since we're only interested in suggesting files that are still in the project and can be visited we need to check the generated suggestions against our current state as is done above with git-related--file-exists-p:

(defun git-related--file-exists-p (relative-filename)
 "Determine if RELATIVE-FILENAME currently exists."
 (file-exists-p
  (expand-file-name relative-filename
   (project-root (project-current)))))

Now its just downhill from here, we're going to define a function to prettify the result for display using the faces we defined earlier:

(defun git-related--propertize (hit)
 "Return a rendered representation of FILE for completion."
 (propertize
  (concat
   (propertize (format "%2.2f" (car hit)) 'face 'git-related-score)
   " ---> "
   (propertize (cadr hit) 'face 'git-related-file))
  'path (cadr hit)))

Then a convenient function to update the graph of a project by calling replay:

(defun git-related-update ()
 "Update graph for the current project."
 (interactive)
 (let* ((default-directory (project-root (project-current)))
        (project-symbol (intern (project-name (project-current))))
        (graph (cl-getf git-related--graphs project-symbol)))
  (setf (cl-getf git-related--graphs project-symbol)
   (git-related--replay graph))))

Finally to navigate by commit similarity we define git-related-find-file like this:

(defun git-related-find-file ()
 "Find files related through commit history."
 (interactive)
 (if (buffer-file-name)
  (let ((default-directory (project-root (project-current))))
   (find-file
    (let* ((selection
            (completing-read "Related files: "
             (mapcar #'git-related--propertize
              (git-related--similar-files
               (cl-getf git-related--graphs (intern (project-name (project-current))))
               (file-relative-name (buffer-file-name) (project-root (project-current)))))
             nil t)))
     (when selection
      (let ((filename (get-text-property 0 'path selection)))
       (find-file filename))))))
  (message "Current buffer has no file")))

Usage and example

So the usage is now like this. In a project (a project being a git repositoty) call git-related-update once (or when you feel the need to re-build the recommendations). Then when visiting a file in the project call git-related-find-file and navigate by commit similarity. In essence if you're working on say a ruby project with a Gemfile and you look for simliar file your first hit will be Gemfile.lock because they will (should) share all commits.

Just running update on magit repository then calling git-related-find-file on magit-push.el will suggest:

magit-example.png

map