Searching code with Sourcegraph
Sourcegraph is a tool for searching and navigating around large code bases. The tool has various search methods, including regular-expression search, and "structural search", which is a relatively new technique that is language-aware. The open-source core of the tool comes with code search, go-to-definition and other "code intelligence" features, which provide ways for developers to make sense of multi-repository code bases. Sourcegraph's code-searching tools can show documentation for functions and methods on mouse hover and allow developers to quickly jump to definitions or to find all references to a particular identifier.
The Sourcegraph server is mostly written in Go, with the core released under the Apache License 2.0; various "enterprise" extensions are available under a proprietary license. The company behind Sourcegraph releases a new version of the tool every month, with the latest release (3.18) improving C++ support and the 3.17 release featuring faster and more accurate code search as well as support for AND and OR search operators.
Code search
The primary feature of Sourcegraph is the ability to search code across one or more repositories. Results usually come back in a second or two, even when searching hundreds of repositories. The default query style is literal search, which will match a search string like "foo bar" exactly, including the quotes. Clicking the .* icon in the right-hand side of the search bar switches to regular expression search, and either of those search modes support case-sensitive matching (by clicking the Aa icon).
The [] icon switches to "structural search", a search syntax created Rijnard van Tonder (who works at Sourcegraph) for his Comby project. Structural searches are language-aware, and handle nested expressions and multi-line statements better than regular expressions. Structural search queries are often used to find potential bugs or code simplifications, for example, a query for the following:
fmt.Sprintf(":[str]")That will find places where a developer can eliminate a fmt.Sprintf() call when it has a single argument that is just a string literal.
The documentation has an architecture diagram that shows the various processes a Sourcegraph installation runs. There is also a more detailed description of the "life of a search query". The front end starts by looking for a repo: filter in the query to decide which repositories need to be searched. The server stores its list of repositories in a PostgreSQL database, along with most other Sourcegraph metadata; Git repositories are cloned and stored in the filesystem normally.
Next, the server determines which repositories are indexed (for a specific revision if specified in the search query) and which are not: both indexing the repositories and indexed searches are handled by zoekt, which is a trigram-based code-search library written in Go. (Those curious about using trigrams for searching code may be interested in Go technical lead Russ Cox's article about it).
Repository revisions that are not indexed are handled by a separate "searcher" process (which is horizontally-scalable via Kubernetes). It fetches a zip archive of the repository from Sourcegraph's server (i.e. gitserver) and iterates through the files in it, matching using Go's regexp package for regular-expression searches or the Comby library for structural searches. By default, only a repository's default branch is indexed, but Sourcegraph 3.18 added the ability to index non-default branches.
Code intelligence
The second main feature of Sourcegraph is what the company calls "code
intelligence": the ability to navigate to the definition of the variable or
function under the cursor, or to find all references to it. By
default, these features use "search-based
heuristics, rather than parsing the code into an AST [abstract syntax
tree]
", but the heuristics seem
to be quite accurate in the tests I ran. The tool found definitions in C,
Python, and Go without a problem, and even found dynamically-assigned
definitions in Python (such as being able to go to the
definition of the assigned and re-assigned scandir_python name
in my scandir project).
More recently, Sourcegraph has implemented a more
precise code-search feature (which uses language-specific parse trees
rather than search heuristics) using Microsoft's Language Server Index
Format (LSIF), a JSON-based file format that is used to store data
extracted by indexers for
language tooling. Sourcegraph has written or maintains LSIF indexers for several languages,
including Go, C/C++, and Python (all
MIT-licensed). Currently, LSIF support in Sourcegraph is opt-in, and
according to the documentation: "It provides fast and precise code
intelligence but needs to be periodically generated and uploaded to your
Sourcegraph instance.
" Sourcegraph's recommendation
is to generate and upload LSIF data on every commit, but developers can
also set up a periodic job to index less frequently.
Code intelligence queries are broken down into three types: hover queries (which retrieve the documentation associated with a symbol to display as "hover text"), go-to-definition queries, and find-references queries. Precise LSIF information is used if it is available, otherwise Sourcegraph falls back to returning "fuzzy" results based on a combination of Ctags and searching.
Open source?
Sourcegraph's licensing is open core, but the delivery is somewhat unusual:
all of the source, including the proprietary code, is in a single public
repository, but the code under the enterprise/ and
web/src/enterprise/ directories are subject to the Sourcegraph
Enterprise license, and the rest of the code is under the Apache
license. The pre-built Docker images provided by Sourcegraph include the
enterprise code "to provide a smooth upgrade path to Sourcegraph
Enterprise
", but the repository provides a build
script that builds a fully open-source image. The enterprise code
includes a check
to disallow more than ten users, but that won't be included in an
open-source build. Overall, building and installing the open-source version is not well
documented and its setup script may be missing some
steps — it definitely feels like a second-class citizen.
Sourcegraph (the company) runs a hosted version of the system that allows anyone to search "top" public repositories from various code hosts. It is unclear how "top" is defined, or exactly what repositories are indexed in this hosted version, but this version provides a good demonstration of the features available. The company's pricing page lists the features that are restricted to the enterprise version, including: the Campaigns multi-repository refactoring tool, support for multiple code hosts, custom branding, live training sessions, and more.
Setup
Installing the pre-built Sourcegraph images was quick using the docker-compose method, as shown in its installation documentation. It took a couple of minutes to get it up and running, and a few more minutes to configure it. I was running it on my local machine, so I used an ngrok tunnel to (temporarily) provide an internet-facing domain with https support (it didn't need this to run, but certain features work better if it is provided). The even quicker single-command Docker installation method also worked fine, but I decided to try out the docker-compose option: it seems slightly more realistic, as it's recommended for small and medium production deployments and not just local testing. For larger, highly-available deployments, Sourcegraph recommends deploying on a Kubernetes cluster.
Very little configuration was required to set things up: creating an admin user, and pointing the system at a code host (in my case, I needed to create a GitHub access token to allow Sourcegraph to access my public and private repositories on GitHub). As soon as the access token was added, Sourcegraph started cloning and indexing the repositories. A couple of minutes later, they were ready to search. The system is optimized for self-hosting; presumably the company wants to make it easy for developers to set it up for a small number of test users (and then ask them to start paying when they go above ten users).
One of the "features" that may give some people pause is what
Sourcegraph calls "pings"; by default, the
tool sends a POST request to
https://sourcegraph.com/.api/updates.com approximately every 30
minutes "to help our product and customer teams
". This
"critical telemetry
" includes the "the email address of
the initial site installer
" and the "total count of existing
user accounts
", presumably so the company can try to contact an
installer about paying for its enterprise offering when the ten-user
threshold is reached. It can only be turned off by modifying the source
code (the ping code is in the open-source core, so someone could
comment out this
line to get rid of it). By default, the system also sends aggregated usage information
for some product features, but this can be turned off by setting the
DisableNonCriticalTelemetry configuration variable. To its credit,
Sourcegraph is up-front about its "ping
philosophy", and clearly states that it never sends source code,
filenames, or specific search queries.
Browser and editor integrations
In addition to the search server and web UI, Sourcegraph provides browser extensions for Chrome and Firefox that enable its features to be used when browsing on hosts like GitHub and GitLab. For example, when reviewing a pull request on GitHub, a developer with the Sourcegraph extension installed can quickly go to a definition, find all references, or see the implementations of a given interface. As of June 2019, GitHub has a similar feature, which uses its semantic library, though the Sourcegraph browser extension seems to be more capable (for example, it finds struct fields, and not just functions and methods). The Sourcegraph browser extension tries to keep a developer on github.com if it can, but for certain links and definitions it goes to the Sourcegraph instance's URL.
Sourcegraph also provides editor
integrations for four popular editors (Visual Studio Code, Atom,
IntelliJ, and Sublime Text). These plugins allow the developer to open the
current file in Sourcegraph, or search the selected text using Sourcegraph
(the plugins open the results in a browser). The browser extensions and
editor plugins fit with one of Sourcegraph's principles:
"We eventually want to be a platform that ties together all of the
tools developers use
".
In conclusion
The development of Sourcegraph is fairly open as well, with tracking issues for the upcoming 3.19 and 3.20 releases, as well as a work-in-progress roadmap. Along with many improvements planned for the core (search and code intelligence), such as "OpenGrok parity", it looks like the company is working on its cloud offering, and that the Campaigns feature will see significant improvements.
Sourcegraph looks to be a well-designed system that is useful,
especially for large code bases and big development teams. In fact, the
documentation implies
that the tool might not be the right fit for small teams: "Sourcegraph is
more useful to developers working with larger code bases or teams (15+
developers).
" Some may also be put off by the poorly-supported open-source build and the phone-home "pings"; however, it does look like some
folks have persisted with the open-source version and have gotten it
working.
Index entries for this article | |
---|---|
GuestArticles | Hoyt, Ben |
Posted Aug 18, 2020 23:29 UTC (Tue)
by riking (subscriber, #95706)
[Link] (1 responses)
We use a nginx hack to have everyone be logged in as "Anonymous" (https://cs.tvl.fyi/depot/-/blob/ops/nixos/www/cs.tvl.fyi....).
Posted Aug 18, 2020 23:31 UTC (Tue)
by riking (subscriber, #95706)
[Link]
Posted Aug 19, 2020 1:02 UTC (Wed)
by jkingweb (subscriber, #113039)
[Link]
Posted Aug 19, 2020 9:12 UTC (Wed)
by hanwen (subscriber, #4329)
[Link]
Posted Aug 22, 2020 13:42 UTC (Sat)
by jezuch (subscriber, #52988)
[Link]
Posted Aug 23, 2020 11:40 UTC (Sun)
by hazmat (subscriber, #668)
[Link]
Searching code with Sourcegraph
I've noticed that not having LSIF does hurt reference lookup for extremely common phrases and variable names. This is obviously a solvable problem, we just haven't gotten it set up yet.
Searching code with Sourcegraph
Searching code with Sourcegraph
Zoekt & trigram background
Searching code with Sourcegraph
Searching code with Sourcegraph