|
|
Subscribe / Log in / New account

Searching code with Sourcegraph

August 17, 2020

This article was contributed by Ben Hoyt

Sourcegraph is a tool for searching and navigating around large code bases. The tool has various search methods, including regular-expression search, and "structural search", which is a relatively new technique that is language-aware. The open-source core of the tool comes with code search, go-to-definition and other "code intelligence" features, which provide ways for developers to make sense of multi-repository code bases. Sourcegraph's code-searching tools can show documentation for functions and methods on mouse hover and allow developers to quickly jump to definitions or to find all references to a particular identifier.

The Sourcegraph server is mostly written in Go, with the core released under the Apache License 2.0; various "enterprise" extensions are available under a proprietary license. The company behind Sourcegraph releases a new version of the tool every month, with the latest release (3.18) improving C++ support and the 3.17 release featuring faster and more accurate code search as well as support for AND and OR search operators.

Code search

The primary feature of Sourcegraph is the ability to search code across one or more repositories. Results usually come back in a second or two, even when searching hundreds of repositories. The default query style is literal search, which will match a search string like "foo bar" exactly, including the quotes. Clicking the .* icon in the right-hand side of the search bar switches to regular expression search, and either of those search modes support case-sensitive matching (by clicking the Aa icon).

[Sourcegraph search]

The [] icon switches to "structural search", a search syntax created Rijnard van Tonder (who works at Sourcegraph) for his Comby project. Structural searches are language-aware, and handle nested expressions and multi-line statements better than regular expressions. Structural search queries are often used to find potential bugs or code simplifications, for example, a query for the following:

    fmt.Sprintf(":[str]")
That will find places where a developer can eliminate a fmt.Sprintf() call when it has a single argument that is just a string literal.

The documentation has an architecture diagram that shows the various processes a Sourcegraph installation runs. There is also a more detailed description of the "life of a search query". The front end starts by looking for a repo: filter in the query to decide which repositories need to be searched. The server stores its list of repositories in a PostgreSQL database, along with most other Sourcegraph metadata; Git repositories are cloned and stored in the filesystem normally.

Next, the server determines which repositories are indexed (for a specific revision if specified in the search query) and which are not: both indexing the repositories and indexed searches are handled by zoekt, which is a trigram-based code-search library written in Go. (Those curious about using trigrams for searching code may be interested in Go technical lead Russ Cox's article about it).

Repository revisions that are not indexed are handled by a separate "searcher" process (which is horizontally-scalable via Kubernetes). It fetches a zip archive of the repository from Sourcegraph's server (i.e. gitserver) and iterates through the files in it, matching using Go's regexp package for regular-expression searches or the Comby library for structural searches. By default, only a repository's default branch is indexed, but Sourcegraph 3.18 added the ability to index non-default branches.

Code intelligence

The second main feature of Sourcegraph is what the company calls "code intelligence": the ability to navigate to the definition of the variable or function under the cursor, or to find all references to it. By default, these features use "search-based heuristics, rather than parsing the code into an AST [abstract syntax tree]", but the heuristics seem to be quite accurate in the tests I ran. The tool found definitions in C, Python, and Go without a problem, and even found dynamically-assigned definitions in Python (such as being able to go to the definition of the assigned and re-assigned scandir_python name in my scandir project).

More recently, Sourcegraph has implemented a more precise code-search feature (which uses language-specific parse trees rather than search heuristics) using Microsoft's Language Server Index Format (LSIF), a JSON-based file format that is used to store data extracted by indexers for language tooling. Sourcegraph has written or maintains LSIF indexers for several languages, including Go, C/C++, and Python (all MIT-licensed). Currently, LSIF support in Sourcegraph is opt-in, and according to the documentation: "It provides fast and precise code intelligence but needs to be periodically generated and uploaded to your Sourcegraph instance." Sourcegraph's recommendation is to generate and upload LSIF data on every commit, but developers can also set up a periodic job to index less frequently.

Code intelligence queries are broken down into three types: hover queries (which retrieve the documentation associated with a symbol to display as "hover text"), go-to-definition queries, and find-references queries. Precise LSIF information is used if it is available, otherwise Sourcegraph falls back to returning "fuzzy" results based on a combination of Ctags and searching.

Open source?

Sourcegraph's licensing is open core, but the delivery is somewhat unusual: all of the source, including the proprietary code, is in a single public repository, but the code under the enterprise/ and web/src/enterprise/ directories are subject to the Sourcegraph Enterprise license, and the rest of the code is under the Apache license. The pre-built Docker images provided by Sourcegraph include the enterprise code "to provide a smooth upgrade path to Sourcegraph Enterprise", but the repository provides a build script that builds a fully open-source image. The enterprise code includes a check to disallow more than ten users, but that won't be included in an open-source build. Overall, building and installing the open-source version is not well documented and its setup script may be missing some steps — it definitely feels like a second-class citizen.

Sourcegraph (the company) runs a hosted version of the system that allows anyone to search "top" public repositories from various code hosts. It is unclear how "top" is defined, or exactly what repositories are indexed in this hosted version, but this version provides a good demonstration of the features available. The company's pricing page lists the features that are restricted to the enterprise version, including: the Campaigns multi-repository refactoring tool, support for multiple code hosts, custom branding, live training sessions, and more.

Setup

Installing the pre-built Sourcegraph images was quick using the docker-compose method, as shown in its installation documentation. It took a couple of minutes to get it up and running, and a few more minutes to configure it. I was running it on my local machine, so I used an ngrok tunnel to (temporarily) provide an internet-facing domain with https support (it didn't need this to run, but certain features work better if it is provided). The even quicker single-command Docker installation method also worked fine, but I decided to try out the docker-compose option: it seems slightly more realistic, as it's recommended for small and medium production deployments and not just local testing. For larger, highly-available deployments, Sourcegraph recommends deploying on a Kubernetes cluster.

Very little configuration was required to set things up: creating an admin user, and pointing the system at a code host (in my case, I needed to create a GitHub access token to allow Sourcegraph to access my public and private repositories on GitHub). As soon as the access token was added, Sourcegraph started cloning and indexing the repositories. A couple of minutes later, they were ready to search. The system is optimized for self-hosting; presumably the company wants to make it easy for developers to set it up for a small number of test users (and then ask them to start paying when they go above ten users).

One of the "features" that may give some people pause is what Sourcegraph calls "pings"; by default, the tool sends a POST request to https://sourcegraph.com/.api/updates.com approximately every 30 minutes "to help our product and customer teams". This "critical telemetry" includes the "the email address of the initial site installer" and the "total count of existing user accounts", presumably so the company can try to contact an installer about paying for its enterprise offering when the ten-user threshold is reached. It can only be turned off by modifying the source code (the ping code is in the open-source core, so someone could comment out this line to get rid of it). By default, the system also sends aggregated usage information for some product features, but this can be turned off by setting the DisableNonCriticalTelemetry configuration variable. To its credit, Sourcegraph is up-front about its "ping philosophy", and clearly states that it never sends source code, filenames, or specific search queries.

Browser and editor integrations

In addition to the search server and web UI, Sourcegraph provides browser extensions for Chrome and Firefox that enable its features to be used when browsing on hosts like GitHub and GitLab. For example, when reviewing a pull request on GitHub, a developer with the Sourcegraph extension installed can quickly go to a definition, find all references, or see the implementations of a given interface. As of June 2019, GitHub has a similar feature, which uses its semantic library, though the Sourcegraph browser extension seems to be more capable (for example, it finds struct fields, and not just functions and methods). The Sourcegraph browser extension tries to keep a developer on github.com if it can, but for certain links and definitions it goes to the Sourcegraph instance's URL.

Sourcegraph also provides editor integrations for four popular editors (Visual Studio Code, Atom, IntelliJ, and Sublime Text). These plugins allow the developer to open the current file in Sourcegraph, or search the selected text using Sourcegraph (the plugins open the results in a browser). The browser extensions and editor plugins fit with one of Sourcegraph's principles: "We eventually want to be a platform that ties together all of the tools developers use".

In conclusion

The development of Sourcegraph is fairly open as well, with tracking issues for the upcoming 3.19 and 3.20 releases, as well as a work-in-progress roadmap. Along with many improvements planned for the core (search and code intelligence), such as "OpenGrok parity", it looks like the company is working on its cloud offering, and that the Campaigns feature will see significant improvements.

Sourcegraph looks to be a well-designed system that is useful, especially for large code bases and big development teams. In fact, the documentation implies that the tool might not be the right fit for small teams: "Sourcegraph is more useful to developers working with larger code bases or teams (15+ developers)." Some may also be put off by the poorly-supported open-source build and the phone-home "pings"; however, it does look like some folks have persisted with the open-source version and have gotten it working.


Index entries for this article
GuestArticlesHoyt, Ben


to post comments

Searching code with Sourcegraph

Posted Aug 18, 2020 23:29 UTC (Tue) by riking (subscriber, #95706) [Link] (1 responses)

We're running a SourceGraph instance for our (public, open source) monorepo at https://cs.tvl.fyi and it works pretty well!
I've noticed that not having LSIF does hurt reference lookup for extremely common phrases and variable names. This is obviously a solvable problem, we just haven't gotten it set up yet.

We use a nginx hack to have everyone be logged in as "Anonymous" (https://cs.tvl.fyi/depot/-/blob/ops/nixos/www/cs.tvl.fyi....).

Searching code with Sourcegraph

Posted Aug 18, 2020 23:31 UTC (Tue) by riking (subscriber, #95706) [Link]

https://github.com/sourcegraph/sourcegraph/pull/11575 was inspired by this, by the way. Looks like it's still in progress.

Searching code with Sourcegraph

Posted Aug 19, 2020 1:02 UTC (Wed) by jkingweb (subscriber, #113039) [Link]

Having the proprietary bits be visible is a nice feature. I suppose you could build your own and have some assurance that there's no code you're not aware of lurking in there?

Zoekt & trigram background

Posted Aug 19, 2020 9:12 UTC (Wed) by hanwen (subscriber, #4329) [Link]

For those interested in trigram search, I did a talk on Zoekt that folks might find interesting. See here: https://www.youtube.com/watch?v=_-KTAvgJYdI

Searching code with Sourcegraph

Posted Aug 22, 2020 13:42 UTC (Sat) by jezuch (subscriber, #52988) [Link]

Not sure if it qualifies as "relatively recent", but the earliest tool for structural search (and refactoring) I'm aware of is James Gosling's Jackpot circa 2000 (it now lives on as part of NetBeans IDE). Linux developers should also be familiar with Coccinelle, so-called "semantic search" tool used to find and fix bugs automatically, and also for widespread refactorings. There's also my favourite feature of Intellij IDEA :) I'm kind of a connoisseur of static analysis and structural search/replace tools so it's good to see another one out there!

Searching code with Sourcegraph

Posted Aug 23, 2020 11:40 UTC (Sun) by hazmat (subscriber, #668) [Link]

This type of "open source" is a imo disturbingly common trend, elastic, timescaledb, etc. As an oss contributor it effectively taints contributors by even doing something as simple as looking at the log or a diff in a repo as commits are often co-mingled, it represents fairly community hostile behavior imo. Elastic for example has gotten in the habit of suing other developers building on the oss (search guard).


Copyright © 2020, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds