Development quotes of the week

[Posted June 6, 2018 by ris]

... the bug in question is indeed in m4 1.1 dated 1993-11-08.

As far as I know, this holds the record for the oldest bug reported in GNU software so far this year. (Maybe we should give Andy a prize; how about a plaque inscribed in EBCDIC? :-)

— Paul Eggert (Thanks to Henrique de Moraes Holschuh)

The Microsoft of today is a company that understands and embraces open-source development, both in the strict technical sense of publishing source code and in the broader sense of community-driven, collaborative development. The movement appears to be genuine, and frankly, that's not something that we should find altogether surprising: there's a hell of a lot of programmers working at the company, and many of them are users or contributors of open-source software themselves. They get it; it was only a matter of time before the company did, too.

— Peter Bright

The biggest screen in your house would seem a logical place to integrate cloud apps, but TVs are walled gardens. While it’s easy enough to hook up a laptop or PC and pop open a browser, there’s no simple, open framework for integrating all that wonderful data over the TV’s other inputs.

— Andrew "bunnie" Huang (Thanks to Paul Wise)

Consider all the data that's used to provide the value-added features on top of git. Issue tracking, wikis, notes in commits, lists of forks, pull requests, access controls, hooks, other configuration, etc.
Is that data stored in a git repository?

Github avoids doing that and there's a good reason why: By keeping this data in their own database, they lock you into the service. Consider if Github issues had been stored in a git repository next to the code. Anyone could quickly and easily clone the issue data, consume it, write alternative issue tracking interfaces, which then start accepting git pushes of issue updates and syncing all around. That would have quickly became the de-facto distributed issue tracking data format.

Instead, Github stuck it in a database, with a rate-limited API, and while this probably had as much to do with expediency, and a certain centralized mindset, as intentional lock-in at first, it's now become such good lock-in that Microsoft felt Github was worth $7 billion.

— Joey Hess

"Is that data stored in a git repository?"

Posted Jun 7, 2018 14:50 UTC (Thu) by mattdm (subscriber, #18) [Link]

> Is that data stored in a git repository?

If you like this, please consider helping build Pagure, git forge software built around this idea. "Easy fix" tickets here: https://pagure.io/pagure/issues?status=Open&tags=easyfix

Re: Joey Hess

Posted Jun 7, 2018 17:42 UTC (Thu) by dw (subscriber, #12017) [Link] (7 responses)

I find the dig at Microsoft deeply distasteful, they are hardly pioneers of "Git with bag on the side" style project management tools. To my knowledge aside from FOSSIL, none actually exist that receive any kind of real-world usage (pet projects aside, one of which was mentioned here).

The idea of stashing everything in Git is a great one, and of course it occurred to more than just the comment author, however if you explore this space even slightly, it's clear that stuffing everything in a DB *is much simpler* than giving control of a set of complex interlinked structures to end-users who can barely work a Git desktop interface at the best of times, all while somehow keeping a best-in-class project management UI in sync with whatever atrocities the users unleash behind the scenes.

If I use 'git rebase' to modify a far right-leaning issue comment's author to Linus Torvalds, what should show up in the UI? Should I have the right to purge comments from the UI at leisure? Should the purges be tracked historically somewhere? (In a DB? In Git.. again?) Can I do a pull request for a change to a project's tickets? But a pull request is a type of ticket!

Of course the simplest explanation is also the cheapest, most crowdpleasing and most paranoid one, and I'm very sad to see it repeated in LWN.

Re: Joey Hess

Posted Jun 7, 2018 17:53 UTC (Thu) by dw (subscriber, #12017) [Link] (4 responses)

Finally a cursory search reveals the "rate limited API" permits 5,000 authenticated requests per hour. The largest repo I could find (Ansible) has only 42k tickets all time, or to put it another way, less than 12 hours for an export cron job to complete. This doesn't even slightly count as lock-in.

rate limited API

Posted Jun 7, 2018 18:48 UTC (Thu) by joey (guest, #328) [Link] (2 responses)

Checking each issue for new comments takes an API call. The API is paginated == more API calls the more issues and comments there are. Finding comments attached to commits is additional API calls (and there was not a way to discover them other than trying every commit in turn last I checked). Repositories can have forks, which can have issues, which can have comments, for piles more API calls. (I know about this in some detail because I've written software to deal with all of it. It's constantly rate limited.)

And then, most developers are not involved with a single software project. It's not uncommon to have dozens of dependencies you want to keep an eye on or are tangentially involved with in your work on a single project. Out of all your other projects.

12 hours to check something out is not fertile ground for a distributed ecosystem to develop: As evidence see the lack of such an ecosystem.

rate limited API

Posted Jun 7, 2018 19:24 UTC (Thu) by dw (subscriber, #12017) [Link]

> Finding comments attached to commits is additional API calls (and there was not a way to discover them other than trying every commit in turn last I checked

Seems that wasn't any time recently:

- https://developer.github.com/v3/issues/comments/#list-com...
- https://developer.github.com/v3/issues/events/#list-event...

> 12 hours to check something out is not fertile ground for a distributed ecosystem to develop

To be clear it was 12 hours to sync 40k tickets and then per the APIs above, O(changes) thereafter.

rate limited API

Posted Jun 8, 2018 9:31 UTC (Fri) by jwilk (subscriber, #63328) [Link]

You shouldn't back up forks. It wasn't submitted upstream == it doesn't exist.

Also, anybody can make a fork and fill it with garbage. If you're backing up all forks blindly (like github-backup does by default), you're susceptible to DoS.

Re: Joey Hess

Posted Jun 7, 2018 19:55 UTC (Thu) by excors (subscriber, #95769) [Link]

Judging by discussions around https://lwn.net/Articles/754779/ on GitHub vs GitLab, projects often prefer GitHub primarily because it's more popular, meaning more potential contributors are familiar with it and already have accounts there, lowering the barriers to them becoming actual contributors. That's what makes GitHub's millions of users so valuable to Microsoft - they will continue to use GitHub, even if it were technically trivial to migrate to another service, and will attract more users, simply because of the network effects that apply to any online service with interaction between users.

https://docs.gitlab.com/ee/user/project/import/github.html makes it look straightforward to copy data from GitHub anyway, so their choice to not implement issue tracking with an awkward and inefficient form of database (i.e. something layered on Git) doesn't seem to be a significant obstacle to getting your data out.

Re: Joey Hess

Posted Jun 7, 2018 23:54 UTC (Thu) by rahulsundaram (subscriber, #21946) [Link]

>To my knowledge aside from FOSSIL, none actually exist that receive any kind of real-world usage (pet projects aside, one of which was mentioned here).

Pagure is sponsored by Red Hat, used extensively by Fedora and was strongly considered by Debian and it works this way.

Re: Joey Hess

Posted Jun 8, 2018 21:35 UTC (Fri) by jhhaller (guest, #56103) [Link]

See Note-db, which is how Gerrit stores almost all data in git since release 2.15. Only the reviewed flag is stored in a database, for performance reasons. While Gerrit doesn't have an issue system, it does have user accounts and code review records, all of which are stored in Git. While some information has external indices for performance, the primary data is still in git. Gerrit is used in a number of high-profile projects, including OpenStack, Eclipse, and Android. Not all Gerrit instances have completely migrated from the older database, but support for the external database will be removed in the 3.0 release.

Development quotes of the week

Posted Jun 7, 2018 17:47 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

Why should the data like issues and PR metadata be stored in git? Flat files are awful for these sorts of things.

This data just needs to be accessible for export.