Leading items
Welcome to the LWN.net Weekly Edition for June 25, 2020
This edition contains the following feature content:
- Updating the Git protocol for SHA-256: the next step in moving Git away from SHA-1.
- Open-source contact tracing, part 1: can contact tracing be done in a privacy-respecting manner with free software?
- Rethinking the futex API: the much-maligned futex API may get a rework.
- Simple IoT Devices using ESPHome: a useful project for home automation projects.
- PHP releases and support: how the PHP project approaches making releases and supporting them thereafter.
- More alternatives to Google Analytics: some more full-featured web-site analytics packages.
This week's edition also includes these inner pages:
- Brief items: Brief news items from throughout the community.
- Announcements: Newsletters, conferences, security updates, patches, and more.
Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.
Updating the Git protocol for SHA-256
The Git source-code management system has for years been moving toward abandoning the Secure Hash Algorithm 1 (SHA-1) in favor of the more secure SHA-256 algorithm. Recently, the project moved a step closer to that goal with contributors implementing new Git protocol capabilities to enable the transition.
Why move from SHA-1
Fundamentally, Git repositories are built on hash values — presently using the SHA-1 algorithm. A simplified explanation of the importance of hash values to Git follows; readers may also be interested in our previous coverage where the details were covered.
SHA-1 hash values are strings that uniquely represent the contents of an object (for example a source file), and no two files should ever produce the same string. In Git, every object has a hash value representation of its contents. The specific directory structure of these objects is stored in an object called a tree object. This tree object is an organized hierarchy of hashes, each one pointing to a specific version of a specific object within the repository. Tree objects themselves, as mentioned, are also hashed when stored into the repository. When a commit to the repository occurs, the basic steps that occur are:
- Files are assigned new hash values (if they changed)
- A tree object is created then hashed, containing all the hashes for all the files in their current state.
- A commit object is created and hashed, referencing the tree object hash
In short, Git uses SHA-1 hashes everywhere to ensure the integrity of the repository's contents by effectively creating a chain of hash values of the objects representing that repository over time, similar to blockchain technology.
The problem with SHA-1, or any hashing algorithm, is that its usefulness erodes if the hashes it produces make collisions likely. A collision, in this case, means two pieces of data that produce the same hash value. If an attacker is able to replace the contents of an object in such a way that it still produces the same hashed value, the idea of trusting the hash value to uniquely define the contents of a Git object breaks down. Worse, if one were to find a way to intelligently produce those collisions, say to inject malicious code, the security implications would be devastating as it would allow a file in the chain to be replaced unnoticed. Since practical compromises of SHA-1 have already happened, it is important to move away from SHA-1. That transition is one step closer with recent developments.
State of the SHA-256 transition
The primary force behind the move from SHA-1 to SHA-256 is contributor brian m. carlson, who has been working over the years to make the transition happen. It has not been an easy task; the original Git implementation hard-coded SHA-1 as the only supported algorithm, and countless repositories need to be transitioned from SHA-1 to SHA-256. Moreover, in the time this transition is taking place, Git needs to maintain interoperability between the two hash algorithms within the context of a single repository, since users may still be using older Git clients.
The problems surrounding that transition are complicated. Different versions of Git clients and servers may or may not have SHA-256 support, and all repositories need to be able to work under both algorithms for some time to come. This means Git will need to keep track of objects in two different ways and seamlessly work correctly, regardless of the hashing algorithm. For example, hash values are often abbreviated by users when referencing commits: 412e40d041 instead of 412e40d041e861506bb3ac11a3a91e3, so even the fact that SHA-256 and SHA-1 hash values are different lengths is only marginally helpful.
In the latest round of patches, carlson proposes changes to the communication protocol logic for dealing with the transition. The patch doesn't sound like it was part of the original transition plans, but became necessary to move forward as carlson notes:
It was originally planned that we would not upgrade the protocol and would use SHA-1 for all protocol functionality until some point in the future. However, doing that requires a huge amount of additional work (probably incorporating several hundred more patches which are not yet written) and it's not possible to get the test suite to even come close to passing without a way to fetch and push repositories. I therefore decided that implementing an object-format extension was the best way forward.
The patch set enhances the pack protocol that is used by Git clients to include keeping track of the hashing algorithm. This is implemented via the new object-format capability. In the patch to the protocol documentation, carlson describes the object-format capability as a way for Git to indicate support for particular hashing algorithms:
This capability, which takes a hash algorithm as an argument, indicates that the server supports the given hash algorithms [...] When provided by the client, this indicates that it intends to use the given hash algorithm to communicate.
If the client supports hashes using SHA-256, this change to the protocol enables that to be specified directly. By omitting the capability, Git will assume the hash values are presented as SHA-1.
This provides a clear path forward for the most commonly used Git protocol (git://). It does not, however, address less desirable methods such as communicating over HTTP (http://), since that method does not provide capabilities. To address these situations, the implementation attempts to guess the type of hash algorithm being used by looking at the hash length. Carlson notes this works, but could be a problem if at some point in the future SHA-256 is replaced with a different algorithm that also produces 256-bit outputs. To this however, carlson says that he believes any hashing algorithm that someday might supersede SHA-256 will be longer than 256-bit:
The other two cases are the dumb HTTP protocol and bundles, both of which have no object-format extension (because they provide no capabilities) and are therefore distinguished solely by their hash length. We will have problems if in the future we need to use another 256-bit algorithm, but I plan to be improvident and hope that we'll move to longer algorithms in the future to cover ourselves for post-quantum security.
Carlson acknowledges that his solution to the technical challenges facing the project moving to SHA-256 isn't ideal. When cloning a repository, for example, the hashing algorithm being used by the parent repository isn't known up front. Carlson's work gets around this in a two step process:
Clone support is necessarily a little tricky because we are initializing a repository and then fetching refs, at which point we learn what hash algorithm the remote side supports. We work around this by calling the code that updates the hash algorithm and repository version a second time to rewrite that data once we know what version we're using. This is the most robust way I could approach this problem, but it is still a little ugly.
What comes next
With this milestone reached, the end is in sight for a fully working implementation of SHA-256 powered repositories. This will be a major milestone in the evolution of Git, and arguably place it on solid footing for the future. In fact, carlson laid out what he expects those last patches will likely consist of:
Additional future series include one last series of test fixes (28 patches) plus six final patches in the series that enables SHA-256 support.
In closing, it is worth noting that one of the reasons this transition has been so hard is that the original Git implementation was not designed to swap out hashing algorithms. Much of the work put in to the SHA-256 implementation has been walking back this initial design flaw. With these changes almost complete, it not only provides an alternative to SHA-1, but also makes Git fundamentally indifferent to the hashing algorithm used. This should make Git more adaptable in the future should the need to replace SHA-256 with something stronger arise.
Open-source contact tracing, part 1
One of the responses to the COVID-19 pandemic consists of identifying contacts of infected people so they can be informed about the risk; that will allow them to search for medical care, if needed. This is laborious work if it is done manually, so a number of applications have been developed to help with contact tracing. But they are causing debates about their effectiveness and privacy impacts. Many of the applications were released under open-source licenses. Here, we look at the principles of these applications and the software frameworks used to build them; part two will look into some applications in more detail, along with the controversies (especially related to privacy) around these tools.
COVID-19 tracing principles
The main goal of COVID-19 tracing applications is to notify users if they have been recently in contact with an infected person, so that they can isolate themselves or seek out testing. The creation of the applications is usually supported by governments, with the development performed by health authorities and research institutions. The Wikipedia page for COVID-19 apps lists (as of early June 2020) at least 38 countries with such applications in use or under development, and at least eight framework initiatives.
The applications trace the people that the user has had contact with for a significant period (for example, 15 minutes) with close physical proximity (a distance around one meter). The complete tracing system usually consists of an application for mobile phones and the server software.
For the distance measurement and detecting the presence of other users, GPS and Bluetooth are the technical solutions used in practice. GPS only appears in a small number of projects because it does not have enough precision, especially inside buildings. It also does not work in enclosed spaces like underground parking and subways.
Most countries have chosen to develop a distance measurement using Bluetooth, generally the Bluetooth Low Energy (BLE) variant, which uses less energy than the classical version. This is important as the distance measurement is done by mobile phones, and so Bluetooth will need to be active most of the time.
The Bluetooth protocol was not designed for these kinds of tasks, though, so research has been done on ways to measure distance accurately. A report [PDF] from the Pan-European Privacy-Preserving Proximity Tracing project shows that it is possible to measure distance using BLE signal strength, specifically received signal strength indication (RSSI). In a contact-tracing system using Bluetooth, the distance measurement is made by the two phones communicating using a specific message format. Since the formats differ between applications, communication is only guaranteed to work if both phones are using the same application.
Centralized versus decentralized
Storing and communicating contacts is the main functionality of COVID-19 tracing applications. Two main approaches exist: centralized and decentralized, while applications may mix ideas from both models.
To understand the difference between those two types of systems we need to take a look on how user identification works. Each user obtains a random global ID number either from the central authority in the centralized approach or as generated by the application for decentralized systems. Since this is the global identification for the user, it reveals their identity. To preserve privacy, this global ID is never exchanged with peers (i.e. other phones) when registering the encounters, though it may be known by the server. Instead, the global ID is used as a seed to generate temporary IDs using a cryptographic hash function (like SHA-256), or an HMAC, taking as an input the global ID and a changing value, like an increasing number or an identification of a time slot. Temporary IDs change frequently, for example every 15 minutes, and they are broadcast for the other users to register.
Centralized systems use a single server (usually controlled by the health authorities), which generates and stores the global IDs of users. When a user is infected, their contact log is uploaded to the health authorities. People who have been in contact then get notified. The technical solutions vary, from manual operation to one that is automated in the application. That process is handled by the health authorities; the user application just receives a result.
Decentralized systems rely on the user's phone to generate both the global and temporary IDs. In those systems, the global ID may also change periodically. When a user is infected, they should upload their temporary IDs, or the information needed to generate them, to a common server. Other users download the shared infection data and their applications search for a contact. This paper [PDF] provides details of a few different decentralized protocols.
The main difference between the two approaches is in who generates the IDs, whether the central server knows them and the identities associated with them, and who calculates the infection risk. Both solutions need a central server to exchange lists of IDs for infected people.
Frameworks
Development of a tracing system first requires a contact-tracing protocol and then an application. Applications are typically developed by the government agencies, and they use one of the existing frameworks (protocols) or create a new one. A number of such frameworks were developed, most of them have at least part of their code released as open source. Here we look at some of them that are, or have been, used in deployed applications.
Temporary Contact Numbers Protocol
The first framework released was the Temporary Contact Numbers Protocol (TCN), which was initially developed by Covid Watch, then maintained by the TCN Coalition. The source code for this decentralized framework, including the protocol and reference implementation, is available under the Apache software license.
Devices using TCN broadcast randomly generated, temporary contact IDs; at the same time, the devices record all of the received ones. The Covid Watch white paper [PDF] gives an overview of the protocol. The actual implementation uses numbers derived from periodically changed keys (the TCN project README provides the cryptographic details), to minimize the data set that needs to be sent when a person is infected. All of the keys that are generated by the user's device are sent to the central server only if the user gets infected.
The TCN framework allows for variations in the implementation; for example, whether or not a central health authority needs to verify an infection report. TCN is (or was) used in a number of tracing applications, including the US-based CoEpi and German ito.
Decentralized Privacy-Preserving Proximity Tracing
Decentralized Privacy-Preserving Proximity Tracing (DP-3T or DP3T) is a decentralized protocol, similar to TCN. It was developed by a number of European research institutions. Its white paper [PDF] describes the algorithm in detail and gives an overview of the security challenges.
The generation of seed values and temporary identifications is performed by the phone, which also computes the risk of infection. The phone only downloads the parameters needed to determine the infection risk (e.g. duration of contact, signal strength) from the health authorities. DP-3T includes a set of additional features to increase privacy. All phones running the application upload dummy contact data to minimize the risk of revealing infected users. It also has an option to allow the infected users to split and edit the report, for example to exclude certain days or time periods.
DP-3T source code is available under the Mozilla Public License 2.0 and the project includes a work-in-progress implementation using the Exposure Notification API.
Exposure Notification API
The Exposure Notification framework was released in April 2020 by Google and Apple. It seems to be inspired by TCN and DP-3T, and has many similarities with them. It uses the same type of periodically changing keys (the cryptographic specification [PDF] gives the details).
The protocol that was part of the application in TCN, DP-3T, or other frameworks, is implemented in Android and iOS, then provided as the Exposure Notification API [PDF]. It includes proximity detection and logging of the encountered keys, but not the notification of an infection; that part needs to be implemented in the application itself. The Exposure Notification API can be used only from applications provided by public health authorities. The specifications are available, but the source code of the implementation is not. The Google terms [PDF] include some specific requirements for the applications, including: only one application per country, that a public health authority must be responsible for all of the data stored, and that no advertising is allowed in the application.
A reference application for Android and an example server implementation are both available as source code under the Apache license. Since the release of the framework, applications that were ported to it include the Italian Immuni (source code under AGPL 3.0) and the Polish ProteGo Safe (source code under GPL 3.0). Another example is Covid Watch, which was one of the original supporters of TCN, but its application replaced TCN with the Exposure Notification framework in May 2020.
The Exposure Notification API solves one problem that many independent applications have encountered (the BlueTrace paper [PDF] describes the problem on page 7): on iOS, Bluetooth location measurement only works if the application is in the foreground. Since the French application does not use the API, the French government has asked Apple to allow background Bluetooth for other applications.
Applications and beyond
In this article, we explained the purpose of contact-tracing applications, the technology they use, and described the reasons they work this way. In the second article, which is coming soon, we will look deeper into some specific applications (that use existing frameworks or develop their own protocols). We will look at how they work, but also cover their open-source availability. Finally, we will consider the controversies and open questions about the deployment of these applications.
Rethinking the futex API
The Linux futex() system call is a bit of a strange beast. It is widely used to provide low-level synchronization support in user space, but there is no wrapper for it in the GNU C Library. Its implementation was meant to be simple, but kernel developers have despaired at the complex beast that it has become, and few dare to venture into that code. Recently, though, a new effort has begun to rework futexes; it is limited to a new system-call interface for now, but the plans go far beyond that.There is a wide range of synchronization options within the kernel, but there have always been fewer options in user space. For years, the only real option was System V semaphores, but it is fair to say that they have never been universally loved. Developers have long wanted a mutex-like option for user space that does not kill performance.
Back in 2002, Rusty Russell proposed a fast user-space mutex mechanism that quickly became known as a "futex"; this feature was present in the 2.6.0 kernel release at the end of 2003 and immediately used to control concurrency for POSIX threads. The initial implementation was just a few hundred lines of code. At its core, a futex is a 32-bit word of memory shared between cooperating processes; a value of one indicates that the futex is available, while anything else marks it as unavailable. A process wishing to acquire a futex will issue a locked decrement instruction, then verify that the resulting value was zero; if so, the acquisition was successful and execution can continue. Releasing the futex is a simple matter of incrementing its value again.
The nice thing about futexes as described so far is that the kernel is not involved in their operation at all; futexes can be acquired and released without the need to make system calls. That cannot be sustained if there is contention for the futex, though; at that point, a task will have to block to wait for the futex to become available. That is where the futex() system call comes in:
int futex(int *uaddr, int futex_op, int val,
const struct timespec *timeout, /* or: uint32_t val2 */
int *uaddr2, int val3);
The initial futex() implementation had only two arguments: uaddr (the address of the futex) and futex_op, which would be either +1 to increment the futex, or -1 to decrement it. The modern equivalents for futex_op are FUTEX_WAKE (to signal that the futex has been freed and wake task(s) waiting for it) or FUTEX_WAIT to block until the futex becomes available.
Many other operations also exist at this point. Over time, the futex interface has gained complexity, including "robust futexes", adaptive spinning, priority inheritance, and much more. See this article for a (somewhat dated) overview, the above-linked man page, or this low-level description for more information.
The current effort to rework futexes appears to be driven by a couple of concerns. One that goes mostly unstated is the desire to create a system-call interface that makes a bit more sense than futex(), which is a complex, multiplexed API with wildly varying arguments and a number of special cases. Whenever a system call is documented in terms like this:
One can conclude with a fair degree of certainty that the API design is not as great as it could be.
In past years, when C-library developers have discussed exposing futex(), they have proposed splitting it into a set of simpler wrapper functions; that work has never been merged. Now, though, the futex2() proposal from André Almeida does the same thing, adding three new system calls:
int futex_wait(void *uaddr, unsigned long val, unsigned long flags, ktime_t *timeout);
int futex_wait_time32(void *uaddr, unsigned long val, unsigned long flags,
old_time32_t *timeout);
int futex_wake(void *uaddr, unsigned long nr_wake, unsigned long flags);
It is a rare patch set that includes a question like: "Has anyone
started worked on a implementation of this interface?
". Almeida's
patch set adds no new functionality; indeed, it is currently rather less
functional than the existing futex() API, lacking support for
features like priority inheritance. Basic futex functionality is
implemented, though, by calling into the existing futex() code.
The purpose of this patch set is clearly not to show off new features at this point; instead, the hope is to nail down what a new futex API should look like, with the new features to be added after that is done. That said, there are some enhancements that the developers have in mind and would like to get into place.
One of those is the ability to wait on multiple futexes at once and be awakened when any one of them becomes available. Gabriel Krisman Bertazi posted an implementation of this functionality (for futex()) in July 2019; it is driven in particular by the needs of Wine, which is emulating a similar Windows feature. In a discussion sparked by another posting of this patch set in March, Thomas Gleixner gently suggested that perhaps the time had come to design a new futex interface where features like this could be added (and used) more easily. The current proposal is a direct result of that suggestion.
That said, the proposed API doesn't handle multiple futexes, but the cover letter from the current series describes a planned addition:
struct futex_wait {
void *uaddr;
unsigned long val;
unsigned long flags;
};
int futex_waitv(struct futex_wait *waiters, unsigned int nr_waiters,
unsigned long flags, ktime_t *timeout);
Another upcoming feature is the ability to handle futexes in a number of common sizes, not just the 32-bit ones supported today.
Then, there is the issue of performance on NUMA systems. The kernel must maintain internal data structures describing the futexes that are currently being waited on; if those structures are kept on the wrong NUMA node, futex operations can sustain a lot of remote-node cache misses, which slows them down considerably. See this article for details. Futexes are often used by threaded processes that are all running on the same NUMA node; their performance would be improved if the kernel kept its data structures on the same node. Thus, there is a "NUMA hint" functionality planned for the new API that would suggest that the kernel keep its associated data structures on a specific node.
While the thinking about how to improve futex functionality in the kernel has clearly entered a new phase, users should not hold their collective breath waiting for new futex features to enter production kernels. The complexity of this subsystem makes developers reluctant to push through quick changes; they have learned the hard way that it's easy for things to go wrong with futexes. So the new API and the implementation of any new features are likely to go through an extended period of review and revision. The "F" in "futex" may stand for "fast", but the implementation of new futex features may not be.
Simple IoT devices using ESPHome
ESPHome is a project that brings together two recent subjects at LWN: The open-source smart hub Home Assistant, and the Espressif ESP8266 microcontroller. With this project, smart home devices can be created and integrated quickly — without needing to write a single line of code.
Introducing ESPHome
ESPHome is a build and deployment system that takes all of the manual coding work out of integrating custom Internet of Things (IoT) devices with Home Assistant. It advertises support for not only the ESP8266, but also its big-brother the ESP32 and even various ESP8266-based off-the-shelf consumer devices from Sonoff. ESPHome achieves a code-free integration by implementing the auto-discovery protocols necessary for Home Assistant to pull the features of the device into the hub with just a few clicks. Wiring up an ESP8266 to the desired hardware, and defining that hardware properly in the configuration, is all that is needed to enable it in the hub.
For hardware wired to an ESP8266 to be used with ESPHome, it must first be supported by an ESPHome component. The ESPHome project's website lists the various hardware it understands how to work with, from sensors to displays. While the collection of IoT device components is not as comprehensive as one could imagine, ESPHome does offer many of the common ones used in smart homes. The project's last release, v1.14.0 in November 2019, included 24 new components.
Test driving ESPHome
To evaluate exactly how things work using ESPHome, I decided to break out my breadboard and see how easy it is to build a sensor that integrated with Home Assistant. In total, the entire process of creating a new sensor took less than 30 minutes — not bad for a first try. For this experiment, I used what I had sitting on my desk: a version of the Wemos D1 Mini ESP8266 development board, and a BME280 combination temperature, humidity, and pressure sensor.
The BME280 sensor communicates using I2C, a serial protocol designed for low-speed devices like microcontrollers. The specific package of the sensor module used requires the wiring of four pins between the sensor and the ESP8266: 3.3V power, ground, SCL (I2C clock), and SDA (I2C data). More details on the wiring process is available for interested readers (it uses a comparable NodeMcu development board instead of a D1 Mini).
ESPHome can be installed from the Python Package Index (PyPI) using pip. For x86_64 systems, there is a Docker image esphome/esphome available — other architectures must use pip. ESPHome is a single command-line tool, esphome, used to perform various tasks like defining devices, building projects, and flashing them to a device. One useful feature when getting started with esphome is an interface to walk the user through starting the process of creating a new device firmware. This interface is accessible using:
$ esphome tempsensor.yaml wizard
Where tempsensor.yaml in the preceding command is the chosen filename for the configuration to be generated. During the walk-through that follows, all the basic information gathered (device name, microcontroller used, and WiFi credentials) serves as the basis for the initial YAML template. Completing the wizard produces the tempsensor.yaml device definition file, with all of the basic requirements ESPHome needs to build a firmware:
esphome:
name: testbmp280
platform: ESP8266
board: d1_mini
wifi:
ssid: "my-ssid"
password: "my-password"
# Enable fallback hotspot (captive portal) in case wifi connection fails
ap:
ssid: "Testbmp280 Fallback Hotspot"
password: "fallback-wpa-password"
captive_portal:
# Enable logging
logger:
# Enable Home Assistant API
api:
password: "password"
ota:
password: "password"
To complete the process of defining the sensor, the base template requires modification to add the sensor definition. As we are using the BME280, this is a matter of looking at the ESPHome documentation for that sensor for guidance. Since the BME280 uses I2C to communicate, first the I2C bus component must be added:
i2c:
sda: D1
scl: D2
scan: true
There are a few things worth pointing out in the I2C block that may not be immediately obvious. In the block, we define the sda and scl pins wired for use as the I2C bus on the ESP8266. This was a less obvious step in the process of making ESPHome work in testing, as physical general-purpose I/O (GPIO) pins exposed on the D1 Mini board don't match the numbering scheme used by the underlying ESP8266 chip. Thankfully, as is true when programming an ESP8266 chip directly using the Arduino Core for ESP8266, identifiers like D1 and D2 matching the labels on the board work for ESPHome templates.
Next, we provide the sensor definition for the BME280, as we determined from the documentation:
sensor:
- platform: bme280
address: 0x76
temperature:
name: "BME280 Temp"
oversampling: 16x
pressure:
name: "BME280 Pressure"
humidity:
name: "BME280 Humidity"
update_interval: 60s
The value of the address key in the preceding block represents the I2C address of the sensor on the I2C bus. The documentation states that this key is an optional field, implying it will be automatically determined at run time. Experience showed during testing, however, that defining it explicitly is necessary. To learn what that address is, the ESP8266 chip needed to be flashed first with I2C scanning enabled. This allowed the firmware to report via the logs the address of the BME280 device on the I2C bus. With the address identified, all that is left is adding it to the sensor definition's address key.
Each IoT device is obviously going to use its own combination of components based on the purpose of the device, thus each YAML device configuration file will be different. Once the configuration file is complete, all that is left is plugging in the D1 Mini via USB and flashing the firmware. Building the complete firmware and flashing it to the device is done using the esphome run command:
$ esphome tempsensor.yaml run
In testing, building and flashing to the ESP8266 works well. Calling the preceding run command does all the heavy lifting to compile and flash the resultant ROM onto the chip without further effort. As soon as the flash process is complete, a colorized log of what is happening on the device displays in the shell as the ESP8266 boots. This log is invaluable in understanding if things are working, and for debugging any problems that may occur. For a device connected via WiFi, logs can also be followed remotely.
Once the device boots the ESPHome firmware, Home Assistant immediately identifies the device on the local network and prompts the user to complete a brief setup process in the Home Assistant UI. Once that process is complete, the sensor data transmits to Home Assistant moving forward. In testing, this experience worked flawlessly.
Advanced features
ESPHome's process of building IoT sensors for a variety of platforms that integrate with Home Assistant is not all the project offers. There are a number of more advanced features in ESPHome that are helpful for practical IoT devices.
To begin, ESPHome implements a "fallback" mode for situations where it fails to make a connection to the configured WiFi. In fallback mode, the device activates itself as a hotspot that can be connected to, providing the user with a portal to adjust WiFi settings or flash a new firmware entirely. The provided hot-spot implements WPA2 for security purposes. ESPHome also makes the process of flashing a routine update to an existing firmware straightforward, as both the run and upload commands of the esphome tool will automatically offer an option to re-flash the device over-the-air (OTA) without any hassles. This OTA mechanism can also be password-protected to prevent attackers from uploading their own ROMs.
Another of the more useful features of ESPHome is the baked-in automation capability. This feature allows programming of basic logic and behaviors into a device without requiring a WiFi connection or network services. For example, consider the following configuration for an IoT device that has a physical switch to turn on/off whatever it controls in addition to accepting commands over WiFi:
switch:
- platform: gpio
pin: GPIO3
name: "Living Room Dehumidifier"
id: dehumidifier1
binary_sensor:
- platform: gpio
pin: GPIO4
name: "Living Room Dehumidifier Toggle Button"
on_press:
then:
- switch.toggle: dehumidifier1
In this example, we define two hardware components: a simple push-button switch wired to GPIO4, and a relay controlling power to a dehumidifier wired to GPIO3. Representing the switch is a binary_sensor, which is a component that supports a variety of triggers based on the current state. In this case, the trigger activates when the user presses the button, triggering on_press. When the on_press trigger activates, the device will call one or more actions in sequence. In the provided example, that is the switch.toggle action. The result is a device controllable over WiFi, or by a physical switch that functions even if networking is unavailable. Different components have different triggers, and the documentation for each is available on the project's site.
Community
The ESPHome project has a healthy community supporting it with 132 contributors and 67 releases to date, including the latest v1.14.0 release. The project itself operates under a dual licensing model where the C++ code is released under GPLv3 and the Python code is released under an MIT license. Those interested in contributing (both documentation or code) can review the contributor guidelines for how best to get involved. There doesn't appear to be a mailing list for the project, but there is a Discord channel available.
PHP releases and support
PHP is used extensively on the web. How new features, security fixes, and bug fixes make their way into a release is important to understand. Likewise, understanding what can be expected in community support for previous releases is even more important. Since PHP-based sites are typically exposed to the Internet, keeping up-to-date is not something a security-minded administrator can afford to ignore.
PHP has not always had a formal release process and corresponding time frame for support; the official policy the project has now wasn't adopted until 2011. Before then, the decisions of when to make releases and how long to support them were both made less formally by key members of the community.
Let's start with PHP versioning, where the project is more or less dependable. The versioning of PHP releases aims to follow Semantic Versioning. Major releases such as 3.0 and 4.0 always come with backward-compatibility breaks. Minor versions, such as 4.1 and 4.2, fix bugs and add new features that are backward-compatible in relation to the major release. Patch releases, such as 4.1.1, tend to be strictly for important bug fixes and should never break backward compatibility.
Note that these guidelines are by no means a hard and fast rule, however, as contributor Stanislav Malyshev points out:
BC [backward compatibility] is inherently subjective and subject to judgement, since strictly speaking any user-visible change, including bug fixes, can influence outcome of some code, and thus break BC. And some unobviously visible changes could too (e.g. making PHP engine 3 times faster could break someone's security based on certain operation in PHP being slow).
Thus, we can state our promise to keep BC as much as we feel is practical, but we should not make a suicide pact of "never change anything, no matter the cost, no matter the circumstance". And the level of acceptability raises from rightmost to leftmost digits, more to the left, more BC break we can accept. But ultimately, I think, it is a case by case basis, with some cases being obvious and some less obvious.
While we cannot provide an accurate estimate to how often backward-compatibility is broken in minor and patch releases, it is not an entirely uncommon occurrence. For example, at least as recently as PHP 7.0.11, backward-compatibility breaks were introduced in bug fixes that subtlety changed the behavior of iconv_substr().
PHP version community support
Historically, once a version of PHP is released, how long it is supported by the community is another matter entirely. Consider the following table constructed from PHP releases dates taken from Wikipedia and end-of-life dates as reported by the project:
PHP Version Release Date End of Life Support Time 4.0 May 22nd, 2000 June 23rd, 2001 13 months 4.1 December 10th, 2001 March 12th, 2002 3 months 4.2 April 22nd, 2002 September 6th, 2002 4 months 4.3 December 27th, 2002 March 31st, 2005 27 months 4.4 July 11th, 2005 August 7th, 2008 36 months 5.0 July 13th, 2004 September 5th, 2005 13 months 5.1 November 24th, 2005 August 24th, 2006 9 months 5.2 November 2nd, 2006 January 6th, 2011 50 months 5.3 June 30th, 2009 August 14th, 2014 61 months 5.4 March 1st, 2012 September 3rd, 2015 42 months 5.5 June 20th, 2013 July 21st, 2016 37 months 5.6 August 28th, 2014 December 31st, 2018 52 months 7.0 December 3rd, 2015 January 10th, 2019 37 months 7.1 December 1st, 2016 December 1st, 2019 36 months 7.2 November 30th, 2017 November 30th, 2020 36 months 7.3 December 6th, 2018 December 6th, 2021 36 months 7.4 November 28th, 2019 November 28th, 2022 36 months
Since the formal release process was adopted in 2011, around the time PHP 5.2 support was ended, this makes sense. According to the release process, PHP releases are based on a three-year life cycle. Each release is supported actively for two years from its official release date. During this time, the community fixes bugs, adds features, and fixes security vulnerabilities. After the two year period, security-related support for the release continues for another year.
According to the policies of the project, support for a specific version ends after three years. Hopefully, in the interim, the majority of applications running on that version of PHP migrate to a more recent version. It is a straightforward release policy, and certainly an improvement over no policy at all. However, things are more complicated than they seem — especially for major, or particularly popular, releases.
Complexities of PHP version support
Making sense of the historic inconsistencies behind the version support time frames requires a closer look at PHP developer metrics. Fundamentally, the PHP community hopes that its users, hosting providers, and distribution packagers will simply keep pace with PHP's release cycle. In fact, trying to encourage this is one of the reasons PHP tends to focus on backward compatibility. That hope, however, is not the reality. Looking at the current usage statistics. According to the May 2020 PHP-version usage statics from Packagist.com, 22% of deployments are using unmaintained versions of PHP. The problem is likely even more severe than this, as those statistics are based only on the version of PHP as reported during use of the Composer dependency manager — a relatively new addition to the PHP ecosystem yet to be adopted by major PHP-based projects like WordPress. W3Techs, using a more direct methodology, reports a much bleaker picture — 47% of relevant web sites having PHP 5 installed as of the time of this publication. That is a concerning statistic, since the last version of PHP 5 lost any support from the community almost a year and a half ago. Why these percentages are so large is a hard question to answer. Some users are likely completely unaware of the problem, others may simply be neglecting it by not updating their applications to work with a more modern version. The Composer-derived figures also better directly measure real PHP usage when compared to the W3Tech figures, which only report if a server has PHP installed. Either way, the community pays attention to these numbers and tends to support older-yet-popular releases longer. This is particularly true in the transitional period between major releases.
While the PHP project does have an official support policy, the reality is that it is often subject to change, with major releases historically deviating from the official policy. Looking back, PHP 4.x was maintained by the community for four years past the release of PHP 5.0, and for over eight years in total; a point made by Zeev Suraski when discussing ending support for PHP 5.6 after the release of PHP 7:
IMHO, I think we need to look at the 5.6 lifecycle very differently from how we look at 5.5 and earlier. This is really the 5.x lifecycle as it's the last version that's relatively completely painless to upgrade to from 5.x (especially 5.3 and later).
PHP 4 was maintained for 4+ years after PHP 5.0 was released (5.0 release July 2004, PHP 4 support ended 8/8/08). Not saying that we need to do the same for 5, but one year upgrade cycle for everyone on 5.x doesn't sound reasonable.
The reason the last release before a new major release gets special attention is all about user adoption. PHP 5 had significant backward-compatibility breaks in the object model compared to PHP 4, and likewise PHP 7 had backward-compatibility breaks when compared to PHP 5.6.
The incompatibilities of major releases extend beyond language features and behaviors alone. Another aspect of the problem is that non-core (PECL) extensions also need updating to be compatible with changes to the internals of PHP itself. If a given extension has Windows support, it may even require upgrading Microsoft Visual C++ in order for extension maintainers to catch up. Some extensions (such as one of mine) never make the jump, and are forever relegated to the abyss of PHP history. Why this happens varies, but it often has to do with how (un)popular an extension is to the broader PHP development community, versus the amount of maintainer effort required.
Backward-compatibility breaks, both in the language and extensions, hold adoption back. Meanwhile, the internals community tries to push both developers and extension maintainers into the future — major releases always implement important new features that the community wants hosting providers to adopt as quickly as possible. The longer support for an old version continues, the more likely everyone will drag their feet and stall adoption, as pointed out by Malyshev around the PHP 7 release:
We could make 5.6 an LTS release with extended support, but the question is given the code delta, would all fixes' authors be willing to do essentially double work? Would extension authors be willing to maintain two branches long-term? And, if that proves to be hard - wouldn't we end up with a situation where they choose to only maintain PHP 5 version (since it's easier and that's where 90% of people are) and extensions go unsupported for PHP 7 for a long time, creating an adoption problem for 7?
All this complexity makes providing dependable version-support policies to users a challenge for PHP; it is largely dependent on something outside of the project's control — adoption. In contrast, the project itself from the beginning has always taken a release early, release often approach, a philosophy encouraged by the memory of the momentum lost to the project's over-reach around PHP 6's attempted transition to Unicode. These factors create a powerful influence, driving contributors to focus on faster releases and limited consensus over formal process. The result is a desire for frequent releases, with support for those releases often a secondary consideration.
These motivations haven't entirely prevented a degree of formality with regard to releases and support, just nowhere near what developers have come to expect from other languages. The less formal approach taken by the project has been successful, but hasn't always served the project well in the past, either. For example, during the PHP 4 to PHP 5 transition, it led to the end-of-life date for PHP 4 remaining undecided. That burden was something the community was determined to avoid in the jump from PHP 5.6 to PHP 7.0. In an attempt to balance the competing forces of forging ahead against user adoption, the community did eventually vote on a Request for Comments (RFC) led by Suraski to extend support for PHP 5.6. The result was extending the official policy to add a year of full support, followed by an extra two years of security support.
What this means for the future
As things stand, the policy on support for releases still doesn't address exceptions to account for the adoption rates that are a significant aspect of that decision. It only defines a minimum time a version will be supported and, over the years, the general support from the community has certainly become more dependable. On major releases specifically, it seems the community is content to hash out long-term support decisions when it becomes a problem to solve, rather than formalize a change in the policy to address it. The wisdom of that approach is debatable, and will likely lead to more passionate discussion on the issue around the time PHP 8 is officially released (which is currently scheduled for November 2020).
Based on past experience during the PHP 5.6 to PHP 7.0 transition it does seem likely that, with PHP 8 on the horizon, PHP 7 support will continue for a longer period of time than the official policy indicates. If the PHP 7 support timetable does extend, the final end-of-life date for it will be at least be clearly defined — the community needs to vote on an RFC to extend support of PHP 7 beyond the current policy. That means users trying to make plans for the future still have to wait until the official release of PHP 8.0 to find out exactly how long applications based on PHP 7 will be supported. PHP 7.4 is likely to be the last version of PHP 7, and officially it will only be supported for another two and a half years.
Like so many things regarding PHP, support and release cycles are not as predictable and consistent as many prefer. However, also like many things in PHP, version support has consistently improved as the project has matured. At the moment, it seems unlikely the opposing forces that drive these decisions will ever truly be reconciled, leaving PHP developers forced to contend with a degree of uncertainty that other languages lack.
More alternatives to Google Analytics
Last week, we introduced the privacy concerns with using Google Analytics (GA) and presented two lightweight open-source options: GoatCounter and Plausible. Those tools are useful for site owners who need relatively basic metrics. In this second article, we present several heavier-weight GA replacements for those who need more detailed analytics. We also look at some tools that produce analytics data based on web-server-access logs, GoAccess, in particular.
Matomo
One of the most popular heavyweight offerings is the open core Matomo. It was formerly called "Piwik" and was created in 2007; LWN looked at Piwik way back in 2010. It's a full-featured alternative to Google Analytics, so companies that need the power of GA can transition to it, but still get the privacy and transparency benefits that come from using an open-source and self-hosted option. As an example, web agency Isotropic recently switched to Matomo:
We chose to do this as we wanted to respect our users privacy, and felt that hosting statistics on our own server was better for both us and them. [...] We needed something that rivaled the functionality of Google Analytics, or was even better than it. The solution needed to offer real-time analytics, geo-location, advertising campaign tracking, heat maps, and be open source.
Even though Matomo is the most popular open-source analytics tool and has been around the longest, it's still only used on 1.4% of the top one million web sites, roughly 2% of GA's market share — it's hard for even well-known open-source software to compete with the $600-billion gorilla.
Like GA, Matomo provides a summary dashboard with a few basic numbers and charts, as well as many detailed reports, including location maps, referral information, and so on. Additionally, Matomo has a feature called "content tracking" that allows automatically tracking users' interactions with the content (clicks and impressions) without writing code, unlike GA, which requires writing JavaScript or installing a third-party plugin. The self-hosted version of Matomo has all of these features, but site owners can also pay for and install various plugins such as funnel measurement, single-sign-on support, and even a rather invasive plugin that records full user sessions including mouse movements.
Matomo is written in PHP and uses MySQL as its data store; installation is straightforward by simply copying the files to a web server with PHP and MySQL installed. It's licensed under the GPLv3; it supports self-hosting for free (standalone or as a WordPress plugin), two relatively low-cost cloud options, and enterprise pricing. Matomo seems like a well-run project and has a fairly active community support forum; it also provides business-level support plans for companies using the self-hosted version.
Open Web Analytics
A similar but less popular tool is Open Web Analytics (OWA), which is also written in PHP and licensed under the GPLv2. OWA uses a donation-based development model rather than having monthly pricing options for a hosted service. Of all the open-source tools, OWA is the one that feels most like a clone of Google Analytics; even its dashboard looks similar to GA's — so it may be a good option for users who are familiar with GA's interface.
OWA is not as feature-rich as Matomo, but still has all the basics: an overview dashboard, web statistics, visitor locations on a map overlay, and referrer tracking. Like Matomo, it comes with a WordPress integration to analyze visitors on those type of sites. It also provides various ways to extend the built-in functionality, including an API, the ability to add new "modules" (plugins), and the ability to hook into various tracking events (e.g. a visitor's first page view or a user's mouse clicks).
OWA is maintained by a single developer, Peter Adams, and has had
periods of significant
inactivity. Recently, development seems to have picked up, with Adams
shipping several new
releases in early 2020.
Some of the warnings on recent releases, such as those
for the 1.6.9 release, may be a bit worrisome, however ("!
IMPORTANT: The API endpoint has
changed!
"). Installation is again straightforward, and just requires
copying the PHP files to a web server and having a MySQL database
available.
Countly
Another open-core option, Countly, was founded in 2013; it is relatively feature-rich and has many dashboard types. Of the tools we are covering, though, it is the one that feels the most like a "web startup", complete with a polished video on its home page and sleek dashboards in its UI. Countly advertises that it is "committed to giving you control of your analytics data".
Countly has a clear distinction between its enterprise edition (relatively expensive, starting at $5000 annually) and its self-hosted community edition, with the latter limited to "basic Countly plugins" and "aggregated data". Countly's core source code is licensed under the GNU AGPL, with the server written using Node.js (JavaScript), and SDKs for Android and iOS written in Java and Objective C.
Countly's basic plugins provide typical analytics metrics such as simple statistics and referrers for web and mobile devices, but also some more advanced features like scheduling email-based reports and recording JavaScript and mobile app crashes. However, its enterprise edition brings in a wide range of plugins (made either by Countly or by third-party developers) that provide advanced features such as HTTP performance monitoring, funnels with goals and completion rates, A/B testing, and so on. Overall, Countly's community edition is a reasonably rich offering for companies with mobile apps or that are selling products online, and it provides the option to upgrade to the enterprise version later if more is needed.
Snowplow
A more generalized event-analytics system is Snowplow Analytics, founded in
2012 and marketed as "the enterprise-grade event data collection
platform
". Snowplow provides the data collection part of the
equation, but it is up to the installer to determine how to model and
display the data. It is useful for larger companies who want control over
how they model sessions or that want to enrich the data with business-specific
fields.
Setting up an installation of Snowplow is definitely not for the faint of heart; it requires configuring the various components, along with significant Amazon Web Services (AWS) setup; it may be possible, but not easy, to install it outside of AWS. However, there is a comprehensive AWS setup guide on the GitHub wiki (and the company does offer for-pay hosted options). Companies can set it up to insert events into PostgreSQL, AWS's columnar Redshift database, or leave the data in Amazon S3 for further processing. Typically a business-intelligence tool like Looker or ChartIO is used to view the data, but Snowplow does not prescribe that aspect.
Snowplow is a collection of tools written in a number of languages, notably Scala (via Spark) and Ruby. It is available under the Apache 2.0 license. Snowplow is used by almost 3% of the top 10,000 web sites, so it may be a reasonable option for larger companies that want full control over their data pipeline.
Analytics using web access logs
All of the systems described above use JavaScript-based tracking: the benefit of that approach is that it provides richer information (for example, screen resolution) and doesn't require access to web logs. However, if server-access logs are available, it may be preferable to feed those logs directly into analysis software. There are a number of open-source tools that do this: three tools that have all been around for over 20 years are AWStats, Analog, and Webalizer. AWStats is written in Perl and is the most full-featured and actively maintained of the bunch; Analog is written in Python and Webalizer is written in C, but neither is actively maintained.
A more recent contender is the MIT-licensed GoAccess, which was designed first as a terminal-based log analyzer, but also has a nice looking HTML view. GoAccess is written in C with only an ncurses dependency, and supports all of the common access-log formats, including those from cloud services such as Amazon S3 and Cloudfront.
GoAccess is definitely the most modern-looking and well-maintained access-log tool, and it generates all of the basic metrics: hit and visitor count by page URL, breakdowns by operating system and browser type, referring sites and URLs, and so on. It also has several metrics that aren't typically included in JavaScript-based tools, for example page-not-found URLs, HTTP status codes, and server response time.
GoAccess's default mode outputs a static report, but it also has an option that updates the data in real time: it updates every 200 milliseconds in terminal mode, or every second in HTML mode (using its own little WebSocket server). GoAccess's design seems well thought-out, with options for incremental log parsing (using data structures stored to disk) and support for parsing large log files using fast parsing code and in-memory hash tables.
The tool is easy to install on most systems, with pre-built packages for all the major Linux package managers, and a Homebrew version for macOS users. It even works on Windows using Cygwin or through the Linux Subsystem on Windows 10.
Wrapping up
All in all, there are several good options for those who need more powerful analytics, or need a system similar to GA, but are open source. For those running e-commerce sites, or in need of features like funnel analysis, Matomo and Countly seem like good choices. Enterprises that need direct control over how their events are stored and modeled should perhaps consider a Snowplow installation. For those who have access to their web logs or just don't want to use JavaScript-based tracking, GoAccess seems like a good choice for web-log analysis in 2020.
Page editor: Jonathan Corbet
Next page:
Brief items>>
