User: Password:
|
|
Subscribe / Log in / New account

LWN.net Weekly Edition for April 24, 2014

The next generation of Python programmers

By Jake Edge
April 23, 2014
PyCon 2014

In a keynote on day two of PyCon 2014 (April 12), Jessica McKellar made an impassioned plea for the Python community to focus on the "next generation" of Python programmers. She outlined the programming-education problem that exists in US high schools (and likely elsewhere in the world as well), but she also highlighted some concrete steps the community could take to help fix it. She expressed hope that she could report progress at the next PyCon, which will also be held in Montréal, Canada, next year.

[Jessica McKellar]

Statistics

McKellar used the US state of Tennessee (known for "country music and Jack Daniels") as an example. That is where she went to high school and where her family lives. There are roughly 285,000 high school students in Tennessee, but only 251 of them took the advanced placement (AP) computer science (CS) exam. That is 0.09%. AP CS is the most common computer science class that high school students in the US can take.

She showed a slide with an Hour of Code promo that had various "important, famous, and rich people", including US President Barack Obama, who are enthusiastic about students learning to code, she said. They believe that all students should at least have the opportunity to learn how to program in school.

But the reality is quite a bit different. In terms of AP participation by students, CS is quite low. She put up a graph of AP test takers by subject for 2013, showing CS with roughly 30,000 takers nationwide, slightly ahead of 2D studio art and well below subjects like history or literature, which have 400,000+ participants each.

The problem, typically, is the availability of classes. McKellar's sister goes to high school in Nashville, Tennessee. She likes electronics and would be interested in taking a class in computer programming. The best that she is offered at her high school is a class on learning to use Microsoft Word, however. That's not just a problem for Tennessee, either, as there are roughly 25,000 high schools in the US and only around 2,300 of them teach AP CS.

She put up more statistics, including the pass rate of the AP CS exam, which was 66% in Tennessee. But of the 25 African-American students who took it, the pass rate was 32%. She showed maps of the US with states highlighted where no African-Americans took the test (11 states), no Hispanics (7), and no girls (3). One of the latter was Mississippi, where the lack of female test takers may be somewhat self-reinforcing; any girl in that state may well get the message that AP CS is not something that girls do. In addition, any girl who considers taking it will have to worry about her results being scrutinized on a national level: "I better pass or I will be a stat people talk about".

AP class participation by gender was next up. There are more AP classes where girls outnumber boys, but for math and science, that balance switches. CS has the worst gender imbalance of any AP class.

The Python community cares about this because it spends not just time and money, but "blood, sweat, and actual tears" trying to improve this imbalance, which starts in high school and even earlier. McKellar said she understands "how empowering programming is" and she wants students to have the opportunity to "engage in computational thinking skills". She wants the politicians that are making decisions about how the country is run to have that experience as well. Fixing this problem is important to the long-term success of the Python language as well.

What can be done

It was a "depressing introduction", she said, with lots of statistics that make it look like an "intractable problem". But there is some solace that can be taken from some of those statistics. While AP CS is the most gender-skewed AP exam, 29% of the test takers in Tennessee were girls. That is largely due to one teacher, Jill Pala in Chattanooga, who taught 30 of the 71 girls who took the exam. McKellar asked: If one teacher can make that big of a difference, what can a 200,000-member community do?

To answer that question, she asked three CS education specialists. If one is "armed with a Python community", what can be done to help educate the next generation of (Python) programmers? She got responses in four main areas: policy, student engagement, supporting teachers, and curriculum development. She said that she would be presenting the full fire hose of ideas in the hopes that everyone in the audience would find at least one that resonated.

Policy

To start with, in 35 states computer science is only an elective that doesn't give any math or science credit to a student who takes it. If a class doesn't "count for anything", students don't want to take it, schools don't want to offer it, and teachers don't want to teach it. One thing that audience members could do is to "pick up the phone" and ask legislators and school boards to change that.

There is also a lack of per-state data on who makes the decisions about what high school graduation requirements are. There is also a lack of a single place to go for per-state credential requirements for a teacher to be certified in CS. This is a problem for policy makers because they have no way to judge their own state policies by comparing them with their neighbors. It is something that could be fixed "in a weekend" with some Python, web scraping, and version control, she said.

Another problem is that AP CS is still taught in Java, which is a bad language to teach in. That is not her "hating on Java", but it is something that teachers say. If you think about what goes into writing "hello world" for Java, it doesn't allow deferring certain concepts (e.g. classes, object-oriented programming), which makes it difficult to understand "if you've never written a for loop before". Java is also getting "long in the tooth" as the AP CS language. Pascal was the that language for fifteen years, followed by C++ for six years, and now Java for the last eleven years.

People have been gathering information about what languages colleges use, what languages college teachers wish they were using, and what language they think they will be using ten years from now. Some of that information is compiled into Reid's List (described in this article [PDF], scroll down a ways), which shows that Python's popularity with college CS programs is clearly on the rise. But Reid's List has not been updated since 2012, as it is done manually, partly via phone calls, she said.

The 2012 list also shows Java with a clear lock on first place for the first programming language taught (Java 197, C++ 82, Python 43, ...). But, McKellar believes that Python's numbers have "skyrocketed" since then. She would like people to engage the College Board (which sets the AP standards) to switch the AP CS exam from Java to Python. The College Board bases its decision on what language teachers want to teach in, so the rise of Python, could be instrumental in helping it to make that decision—especially if that rise has continued since 2012. AP CS courses in Python would make for a "more fun computing experience", she said, and by engaging with the College Board, that change could happen in four to six years.

Student engagement

Students don't really know what CS is or what it is about, so they don't have much interest in taking it. But there are lots of existing organizations that teach kids, but don't know programming: summer camps, Boy Scouts, Girl Scouts, etc. This is an opportunity for the Python community to work with these groups to add a programming component to their existing activities.

There are also after-school programs that lack programming teachers. The idea is to take advantage of existing programs for engaging students to help those students learn a bit about computer science. That may make them more likely to participate in AP CS when they get to high school.

Supporting teachers

CS teachers are typically all alone, as there is generally only one per high school. That means they don't have anyone to talk to or to bounce ideas off of. But the Python community is huge, McKellar said, so it makes sense to bring those high school teachers into it.

Pythonistas could offer to answer lesson plan questions. Or offer to be a teaching assistant. They could also volunteer to visit the class to answer questions about programming. Inviting the teacher to a local user group meeting might be another way to bring them into the community.

Curriculum development

There is a new AP CS class called "CS Principles" that is being developed and tested right now. It covers a broader range of topics that will appeal to more students, so it is a "really exciting way to increase engagement", she said. So far, though, there is no day-to-day curriculum for the course in any language. That is a huge opportunity for Python.

If the best way to teach the CS Principles class was with a curriculum using Python, everyone would use it, she said. Teachers have a limited amount of time, so if there is an off-the-shelf curriculum that does a good job teaching the subject, most will just use it. The types of lessons that are laid out for the class look interesting (e.g. cryptography, data as art, sorting) and just require some eyes and hands to turn them into something Python-oriented that can be used in the classroom. Something like that could be used internationally, too, as there aren't many curricula available for teaching programming to high school students.

Deploying Python for high schools can be a challenge, however. She talked with one student who noted that even the teachers could not install software in the computer lab—and the USB ports had been filled with glue for good measure. That means that everything must be done in the browser. Runestone Interactive has turned her favorite book for learning Python, Think Python, into an interactive web-based textbook. The code is available on GitHub.

Perhaps the most famous browser-based Python is Skulpt, which is an implementation of the language in JavaScript (also available on GitHub). There are currently lots of open bugs for things that teachers want Skulpt to be able to do. Fixing those bugs might be something the community could do. Whether we like or not, she said, the future of teaching programming may be in the browser.

Summing up

Since we are starting from such terrible numbers (both raw and percentage-wise), a small effort can make a big difference, McKellar said. The Python Software Foundation (PSF), where McKellar is a director, is ready to help. If you are interested in fixing Skulpt bugs, for example, the PSF will feed you pizza while you do that (in a sprint, say). It will also look to fund grant proposals for any kind of Python-education-related project.

She put forth a challenge: by next year's PyCon, she would like to see every attendee do one thing to further the cause of the next generation of Python programmers. At that point, she said, we can look at the statistics again and see what progress has been made. As she made plain, there is plenty of opportunity out there, it just remains to be seen if the community picks up the ball and runs with it.

Slides and video from McKellar's keynote are available.

Comments (77 posted)

Pickles are for delis

By Jake Edge
April 23, 2014
PyCon 2014

Alex Gaynor likes pickles, but not of the Python variety. He spoke at PyCon 2014 in Montréal, Canada to explain the problems he sees with the Python pickle object serialization mechanism. He demonstrated some of the things that can happen with pickles—long-lived pickles in particular—and pointed out some alternatives.

[Alex Gaynor]

Pickle introduction

He began by noting that he is a fan of delis, pickles, and, sometimes, software, but that some of those things—software and the Python pickle module—were also among his least favorite things. The idea behind pickle serialization is simple enough: hand the dump() function an object, get back a byte string. That byte string can then be handed to the pickle module's load() function at a later time to recreate the object. Two of the use cases for pickles are to send objects between two Python processes or to store arbitrary Python objects in a database.

The pickle.dumps() (dump to a string) method returns "random nonsense", he said, and demonstrated that with the following code:

    >>> pickle.dumps([1, "a", None])
    "(lp0\nI1\naS'a'\np1\naNa."

By using the pickletools module, which is not well-known, he said, one can peer inside the nonsense:

    >>> pickletools.dis("(lp0\nI1\naS'a'\np1\naNa.")
	0: (    MARK
	1: l        LIST       (MARK at 0)
	2: p    PUT        0
	5: I    INT        1
	8: a    APPEND
	9: S    STRING     'a'
       14: p    PUT        1
       17: a    APPEND
       18: N    NONE
       19: a    APPEND
       20: .    STOP
The pickle format is a simple stack-based language, similar in some ways to the bytecode used by the Python interpreter. The pickle is just a list of instructions to build up the object, followed by a STOP opcode to return what it has built so far.

In principle, dumping data to the pickle format is straightforward: determine the object's type, find the dump function for that type, and call it. Each of the built-in types (like list, int, or string) would have a function that can produce the pickle format for that type.

But, what happens for user-defined objects? Pickle maintains a table of functions for the built-in types, but it can't do that for user-defined classes. It turns out that it uses the __reduce__() member function that returns a function and arguments used to recreate the object. That function and its arguments are put into the pickle, so that the function can be called (with those arguments) at unpickling time. Using the built-in object() type, he showed how that information is stored in the pickle (the output was edited by Gaynor for brevity):

    >>> pickletools.dis(pickle.dumps(object()))
	0: c    GLOBAL     'copy_reg _reconstructor'
       29: c        GLOBAL     '__builtin__ object'
       55: N        NONE
       56: t        TUPLE
       60: R    REDUCE
       64: .    STOP
The _reconstructor() method from the copy_reg module is used to reconstruct its argument, which is the object type from the __builtin__ module. Similarly, for a user-defined class (again, output has been simplified):
    >>> class X(object):
    ...  def __init__(self):
    ...   self.my_cool_attr = 3
    ...
    >>> x = X()
    >>> pickletools.dis(pickle.dumps(x))
	0: c    GLOBAL     'copy_reg _reconstructor'
       29: c        GLOBAL     '__main__ X'
       44: c        GLOBAL     '__builtin__ object'
       67: N        NONE
       68: t        TUPLE
       72: R    REDUCE
       77: d        DICT
       81: S    STRING     'my_cool_attr'
      100: I    INT        3
      103: s    SETITEM
      104: b    BUILD
      105: .    STOP
The pickle refers to the class X, by name, as well as the my_cool_attr attribute. By default, Python pickles all of the entries in x.__dict__, which stores the attributes of the object.

A class can define its own unique pickling behavior by defining the __reduce__() method. If it contains something that cannot be pickled (a file object, for example), some kind of custom pickling solution must be used. __reduce__() needs to return a function and arguments to be called at unpickling time, for example:

    >>> class FunkyPickle(object):
    ...  def __reduce__(self):
    ...   return (str, ('abc',),)
    ...
    >>> pickle.loads(pickle.dumps(FunkyPickle()))
    'abc'

Unpickling is "shockingly simple", Gaynor said. If we look at the first example again (i.e. [1, 'a', None]), the commands in the pickle are pretty straightforward (ignoring some of the extraneous bits). LIST creates an empty list on the stack, INT 1 puts the integer 1 on the stack, and APPEND appends it to the list. The string 'a' and None are handled similarly.

Pickle woes

But, as we've seen, pickles can cause calls to any function available to the program (built-ins, imported modules, or those present in the code). Using that, a crafted pickle can cause all kinds of problems—from information disclosure to a complete compromise of the user account that is unpickling the crafted data. It is not a purely theoretical problem, either, as several applications or frameworks have been compromised because they unpickled user-supplied data. "You cannot safely unpickle data that you do not trust", he said, pointing to a blog post that shows how to exploit unpickling.

But, if the data is trusted, perhaps because we are storing and retrieving it from our own database, are there other problems with pickle? He put up a quote from the E programming language web site (scroll down a ways) that pointed to the problem:

Do you, Programmer, take this Object to be part of the persistent state of your application, to have and to hold, through maintenance and iterations, for past and future versions, as long as the application shall live?

- Erm, can I get back to you on that?

He then related a story that happened "many, many maintenance iterations ago, in a code base far, far away". Someone put a pickle into a database, then no one touched it for eighteen months or so. He needed to migrate the table to a different format in order to optimize the storage of some of the other fields. About halfway through the migration of this 1.6 million row table, he got an obscure exception: "module has no attribute".

As he mentioned earlier, pickle stores the name of the pickled class. What if that class no longer exists in the application? In that case, Python throws an AttributeError exception, because the "'module' object has no attribute 'X'" (where X is the name of the class). In Gaynor's case, he was able to go back into the Git repository, find the class in question, and add it back into the code.

A similar problem can occur if the name of an attribute in the class should change. The name is "hardcoded" in the pickle itself. In both of his examples, someone was doing maintenance on the code, made some seemingly innocuous changes, but didn't realize that there was still a reference to the old names in a stored pickle somewhere. In Gaynor's mind, this is the worst problem with pickles.

Alternatives

But if pickling is not a good way to serialize Python objects, what is the alternative? He said that he advocated writing your own dump() functions for objects that need them. He demonstrated a class that had a single size attribute, along with a JSON representation that was returned from dump():

    def dump(self):
        return json.dumps({
            "version" : 1,
            "size": self.size
        })
The version field is the key to making it all work as maintenance proceeds. If, at some point, size is changed to width and height, the dump() function can be changed to emit "version" : 2. One can then create a load() function that deals with both versions. It can derive the new width and height attributes from size (perhaps using sqrt() if size was the area of a square table as in his example).

Writing your own dump() and load() functions is more testable, simpler, and more auditable, Gaynor said. It can be tested more easily because the serialization doesn't take place inside an opaque framework. The simplicity also comes from the fact that the code is completely under your control; pickle gives you all the tools needed to handle these maintenance problems (using __reduce__() and a few other special methods), but it takes a lot more code to do so. Custom serialization is more auditable because one must write dump() and load() for each class that will be dumped, rather than pickle's approach which simply serializes everything, recursively. If some attribute got pickled improperly, it won't be known until the pickle.load() operation fails somewhere down the road.

His example used JSON, but there are other mechanisms that could be used. JSON has an advantage that it is readable and supported by essentially every language out there. If speed is an issue, though, MessagePack is probably worth a look. It is a binary format and supports lots of languages, though perhaps somewhat fewer than JSON.

He concluded his talk by saying that "pickle is unsafe at any speed" due to the security issues, but, more importantly, the maintenance issues. Pickles are still great at delis, however.

An audience member wondered about using pickles for sessions, which is common in the Python web-framework world. Gaynor acknowledged pickle's attraction, saying that being able to toss any object into the session and get it back later is convenient, but it is also the biggest source of maintenance problems in his experience. The cookies that are used as session keys (or signed cookies that actually contain the pickle data) can pop up at any time, often many months later, after the code has changed. He recommends either only putting simple types that JSON can handle directly into sessions or creating custom dump() and load() functions for things JSON can't handle.

There are ways to make pickle handle code updates cleanly, but they require that lots of code be written to do so. Pickle is "unsafe by default" and it makes more sense to write your own code rather than to try to make pickle safe, he said. One thing that JSON does not handle, but pickle does, is cyclic references. Gaynor believes that YAML does handle cyclic references, though he cautioned that the safe_load() function in the most popular Python implementation must be used rather than the basic load() function (though he didn't elaborate). Cyclic references are one area that makes pickle look nice, he said.

One of the biggest lessons he has learned when looking at serialization is that there is no single serialization mechanism that is good for all use cases. Pickle may be a reasonable choice for multi-processing programs where processes are sending pickles to each other. In that case, the programmer controls both ends of the conversation and classes are not usually going to disappear during the lifetime of those processes. But the clear sense was that, even in that case, Gaynor would look at a solution other than pickle.

The video of the talk is at pyvideo.org (along with many other PyCon videos). The slides are available at Speaker Deck.

Comments (6 posted)

Transmageddon media transcoder reaches 1.0

By Nathan Willis
April 23, 2014

Version 1.0 of the Transmageddon media transcoder was released in early April, complete with several new features in addition to the implicit stability cred that accompanies a 1.0 release. The tool itself looks unassuming: it occupies a small GTK+ window with little on display beyond a few pull-down menu selectors and a "Transcode" button, but it fills an important gap in the free software landscape. Support for compressed media formats can often turn into a minefield, with patent and licensing problems on one side and lack of free-software support on the other. Tools like Transmageddon that convert one format to another are in some sense a necessary evil as long as there are multiple audio/video formats to choose from, but a good free software tool can minimize the pain, making format conversion easy to tackle without constructing the lengthy and arcane command sequences demanded by command-line encoders.

[Transmageddon GUI]

The new release was made April 1, and was followed up by a 1.1 release on April 10 that fixed a pair of important bugs. Source bundles are available for download on the project's web site, as are RPM packages. Transmageddon requires GTK+3, GLib 2, Python 3, and GStreamer 1.0. The formats that the application can handle depend on which GStreamer plugins are installed; for the fullest range of options, all of the plugin collections are recommended, including the Bad and Ugly sets.

Christian Schaller started Transmageddon as a kind of personal experiment to familiarize himself with Python and GStreamer development back in 2009. Since then, the application has grown a bit in scope, but mostly in accordance with new features supported by GStreamer. The original intent, for instance, was to support simple output profiles that targeted portable devices, so that a user could transcode a video file for playback on the Nokia 810 tablet or Apple iPod. As most people who have experimented with transcoding videos have discovered, though, there are still quite a few options to juggle for any one target device, including separate options for the video and audio codecs chosen and what to do with multiple audio tracks. In addition to that, of course, the landscape of codecs and container formats has continued to evolve over the intervening years, as it will certainly continue to do.

Transmageddon's main competition in the easy-to-use media transcoder space is arguably HandBrake, which also offers a user-friendly GUI and is available under the GPL. But HandBrake has its own oddities that inhibit its support from Linux distributions, starting with the fact that it ships linked against GPL-incompatible components like the versions of FAAC that include proprietary code.

In practice, the other main transcoding approach in use is directly calling the command-line transcoding features exposed by FFmpeg, Mencoder, and other media-playback libraries. The difficulty, of course, is that ease of use is severely lacking with this approach. One often needs a different incantation for almost every input and output option, and the command-line libraries have a notorious reputation for esoteric syntax that is difficult to memorize—a reputation that, frankly, is well-earned. Command-line usage by end users is not the main goal of any of these library projects.

Transmageddon, on the other hand, works—and it generally works without too many confusing errors that are hard to debug. The user selects the input file from a GTK+ file-chooser dialog, Transmageddon automatically detects the makeup and format of the audio and video data within, and presents drop-down lists with the output options supported by the available GStreamer plugins.

[Transmageddon and audio options]

The 1.0 release introduced three major new features: support for multiple audio streams, support for directly ripping DVDs, and support for defining custom transcoding presets. The multiple audio stream support means that Transmageddon has the ability to transcode each audio stream independently—including omitting some streams, and including transcoding each stream into a different codec. DVDs commonly include multiple audio tracks these days, and for some (such as audio commentary tracks) it might make sense to use a speech-specific codec like Speex, rather than a general-purpose codec that must handle the blend of speech, music, and sound effects found in the primary audio mix.

[Transmageddon DVD ripping]

DVD-ripping, of course, is a common use case that Transmageddon had simply never supported before. The legality of ripping DVDs—even DVDs that you own—is a matter where the entertainment industry vehemently clashes with the software industry (and many consumers), so manually installing a few optional libraries like libdvdcss2 is required. Some Linux distributions make this step easier than others, but for a readable DVD, Transmageddon accurately detects the disc's media files, and transcoding them to a more portable format is not any more difficult than transcoding other content.

Custom presets for Transmageddon can be created by making an XML configuration file that follows the application's profile syntax. The syntax is straightforward, with <video> and <audio> elements encapsulating basic information like the codec to be used, frame rate, and number of audio channels. The only genuinely complex part of defining a custom profile is likely to be knowing and understanding the options for the chosen codecs; for help with that task, the Transmageddon site recommends getting familiar with GStreamer's gst-inspect-1.0 utility.

Two other low-level features also made it into the 1.0 release: support for Google's VP9 video codec and the ability to set a "language" tag for the input file (which is then preserved in the metadata of the output file). VP9 is now supported by Firefox, Chrome/Chromium, FFmpeg, and a few other applications, although it might not yet be called widespread. The language-tagging feature is just the beginning of better language support: DVDs typically tag their audio tracks with a language (which is metadata that Transmageddon preserves in the output file); in the future, Schaller said, he hopes to support multiple language tags on all input files, to simplify finding an audio track by language during playback.

Given its ease of use, Transmageddon is likely to attract quite a few users. Almost everyone runs into the need to convert a media file into another format at one time or another; some people do such transcoding on a regular basis, particularly for their portable devices. But Transmageddon is also worth exploring because it is built on GStreamer, which in recent years has grown into a library in widespread use by projects far beyond the GNOME desktop environment. GStreamer's plugin architecture makes it possible to build and use a multimedia system without any problematic codecs—and to build it without the hassle of recompiling from source. Distributors and system builders can address codec patent licenses and other such issues as they see fit, often on a per-plugin basis.

Transmageddon thus meets an important need with a simple and straightforward interface. At the same time, though, its existence also demonstrates that despite the popularity of GStreamer, FFmpeg, MPlayer, and similar big-name multimedia projects, transcoding media files remains an end-user's problem. Life for the average user would be much simpler if media applications like MythTV, XBMC, gPodder, and the various "music store" plugins of audio players automatically handled transcoding duties. But they don't. Partly that is because the never-ending codec wars make it unduly hard for a free-software application to automatically read and convert proprietary file formats, of course, but the upshot is that users have to juggle the codecs, containers, and transcoding themselves.

Comments (2 posted)

Page editor: Jonathan Corbet

Inside this week's LWN.net Weekly Edition

  • Security: Camlistore; New vulnerabilities in java, kernel, mysql, qemu, ...
  • Kernel: Ktap or BPF?; SYSV SHM limits; Loopback NFS.
  • Distributions: Fedora's firewall furor; long term support for Debian 6.0, Ubuntu, ...
  • Development: Testing with cwrap; QEMU 2.0.0; GCC 4.9.0; A Freedesktop Summit 2014 report; ...
  • Announcements: 100 Million Downloads of Apache OpenOffice, Open Source Seeds, LAC14 interviews, ...
Next page: Security>>

Copyright © 2014, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds