Microformats turn 9 years old
At his blog, Tantek Çelik writes about the ninth birthday of the microformats effort, which seeks to express semantic information in web pages through the use of attribute names within HTML elements, in contrast to comparatively "heavyweight" schemes like RDFa and Microdata. Çelik notes that the community-driven process of microformats' development seems to have enabled its survival. "Looking back nine years ago, none of the other alternatives promoted in the 2000s (even by big companies like Google and Yahoo) survive to this day in any meaningful way
", he says. "
Large companies tend to promote more complex solutions, perhaps because they can afford the staff, time, and other resources to develop and support complex solutions. Such approaches fundamentally lack empathy for independent developers and designers, who don't have time to keep up with all the complexity.
" In addition to his analysis about the past nine years (including an exploration of the down side of email-based discussions), Çelik takes the occasion to announce that microformats2 has now been upgraded to the status of ready-to-use recommendation, and points site maintainers to tools to support the transition.
Posted Jun 21, 2014 0:09 UTC (Sat)
by kenmoffat (subscriber, #4807)
[Link] (7 responses)
It appears to be something for marketing.
Posted Jun 21, 2014 0:47 UTC (Sat)
by jd (guest, #26381)
[Link] (6 responses)
For example, tagging articles with key words so that they can be sorted or selected would be an example of how this sort of stuff is done in practice.
Or you could use it to produce customizable web pages. You know that a user has to have a certain set of features in a template and that the same feature can't be implemented in two different ways. Not an issue. SPARQL, the semantic web's version of SQL, can give you the options and enforce the constraints.
Maybe you want better cross-referencing in a wiki. Specify in SPARQL how to identify relationships of interest.
I like RDFa and tools like Protege for defining content association because it has oomph. I dislike microformats because it gets really, really ugly when you want to use server-side if/then/else logic. However, they all have value.
And none of it is in marketing. Marketing hates the Semantic Web because it's transparent to users (so there's no metric you can use to prove its value) and allows any external user to add their own value to the site (which puts marketing's jobs at risk).
Posted Jun 21, 2014 4:04 UTC (Sat)
by b7j0c (guest, #27559)
[Link] (5 responses)
go look at sites with high pagerank. none of them use sematic web tech
Posted Jun 21, 2014 8:09 UTC (Sat)
by jd (guest, #26381)
[Link]
Seriously, Associated Press and a few other news distributors mandate the use of microfotmats to correctly attribute articles. Equally seriously, I don't know of any website that does this. I installed such a system for a newspaper I worked for for a while, but last I looked, they'd ripped out everything I'd worked on.
BBC Sports were supposed to use RDFa, not sure how they got on with it.
Not sure it's a totally dead idea, it's only just become part of the mainstream Apache architecture. Having said that, with no Freshmeat/Freecode any more (boooooo!!!), nobody will know the facility exists. In that case, I guess it is more-or-less dead.
Pity. I liked the idea that dynamic content could generate other dynamic content. It makes a useful "See Also" so much easier to implement.
Someday, on my tombstone, there will be the words "...and he still hadn't finished that SPARQL query he promised me..."
Posted Jun 23, 2014 19:21 UTC (Mon)
by justincormack (subscriber, #70439)
[Link] (3 responses)
Posted Jun 24, 2014 16:28 UTC (Tue)
by b7j0c (guest, #27559)
[Link] (2 responses)
Posted Jul 4, 2014 9:39 UTC (Fri)
by alex (subscriber, #1355)
[Link] (1 responses)
Posted Jul 4, 2014 10:59 UTC (Fri)
by khim (subscriber, #9252)
[Link]
Posted Jun 21, 2014 3:59 UTC (Sat)
by b7j0c (guest, #27559)
[Link] (35 responses)
the initial response to this was the yahoo directory. a human categorized the page, and that human knew if your page was actually about "britney spears y2k sex stock market nasdaq" as the meta tag said, or just another crappy personal homepage.
then google came along and use the notion of in-bound links to fix this problem in a way that was more scalable than yahoo.
the search engines that actually used page metadata all died.
this is why microformats cannot work - as soon as user-defined metadata becomes ubiquitous, it will be gamed, and ultimately ignored.
we have better tools now, two decades of understand how the web works and much progress in natural language processing.
Posted Jun 21, 2014 6:55 UTC (Sat)
by gmatht (subscriber, #58961)
[Link] (1 responses)
For example, say that the search engine would otherwise have included a website for my search "Linux flower arranging", but since I didn't include "Britany" it decides not to show the site, instead it registers that the site is owed another hit. Sometime later when another web user searches for Britany, but its algorithm would otherwise not consider the site to be quite relevant enough to be included, it then shows them the site and deducts one from the number of hits the site is owed.
Posted Jun 21, 2014 10:01 UTC (Sat)
by roblucid (guest, #48964)
[Link]
Some like lwn.net are deserving of many search hits.
I could imagine the accounting required to be "fair" is changing state, so not only messes with query caching but could be performance problematic due to a distributed update for the hit counter.
Posted Jun 21, 2014 8:14 UTC (Sat)
by tialaramex (subscriber, #21167)
[Link] (31 responses)
I get it that a certain type of person has come to see the web as just a bunch of advertisers passing user eyeballs around for money, but a lot of us are actually trying to get stuff done here. Microformats are one of several ways to _sidestep_ your natural language processing for the case where we want machines to read our content, which in fact is a lot of the time. Personally my stuff uses RDFa, but microformats are a perfectly sensible choice too.
This isn't about writing "My web page is about Britney Spears and I'm 14 and I like ponies and I want to be a hairdresser" so much as "This product has UPC 5060095860304" and "This facility is at lat 3.24140 lon 1.02880" indeed for most of us it's not about search engines at all, the article mentions them only briefly and you seem to have seized on that.
If you'd followed the links you would see that, of course, Google does read microformats. It doesn't trust a microformat tag that says the page is about Coca-Cola any more than it would trust the human language assertion of same, but it does have the advantage of being unambiguous, which you really seem to have underated. Tools processing microformats can be simpler AND more reliable than doing NLP everywhere.
Not everything is about advertising or the open, unbounded, untrusted web. In fact, although Google would much prefer nobody thought about it too hard, it looks increasingly as though actually very little valuable economic activity is generated by the advertisements. They're basically a distraction, albeit one that is currently making some people very rich. My supermarket, for example, is paying for highly targeted and superficially effective grocery advertisements. "Effective" because the people the adverts target are buying those groceries - a few pennies on advertising seems to bring in a ~£50 purchase which is a good return, "Superficially" because those are the people who already buy those groceries every week from that retailer, so the advert made no difference to actual sales.
Posted Jun 21, 2014 21:44 UTC (Sat)
by h2 (guest, #27965)
[Link] (24 responses)
You nailed this, glad to see someone here knows the actual business.
He seized on it because ALL methods like this will always be aggressively abused by seo types, particularly in the gray and black hat arenas.
Anyone who thinks otherwise has no experience in the field in my opinion, or is so new to it, or so naive, that they simply aren't aware of the history.
Say you want to identify upc, as you mentioned, to you, that's a clear simple tool with easy to predict outcomes, to an seo, it's an entirely new area to spread into, now they can fill their networks of fake sites with auto generated articles that contain the relevant microcodes in systematically applied fashions, generally these sites are created almost instantly via automation software, to levels most people simply have no idea of.
So what will happen out here in the real world is that at a certain point, a statistically relevant proportion of pages using such methods will be known to be spam sites, then what will happen is that google, the main target of seos due to their market domination, will release a search algo update, could be rolled into an existing one, like panda, which is related to onpage/onsite content, or they might release an entirely new module, which will target such sites, and have the inevitable friendly fire casualities, where people lose their rankings due to some algo update.
That's why stuff like this isn't used, anyone with any experience in this search gaming industry knows what is up, and anyone without it, well... they probably might fall for it. Seo is an ugly sleazy business, and they are better at it than we are, just like virus authors etc are better at tricking people than people are at not getting tricked, it's simple, focus on a field of engineering, where breaking the system is the goal, not building it, and it's incredibly easy to crack stuff if you are given tools to do it.
On the bright side, the 'semantic we', which I always considered a total joke created by utterly naive but well intentioned pseudo engineers who were utterly lacking in any real world web experience, ie, the reality of seo spam, google, etc is now hopefully a total dead entity.
While I avoided the black hat side of this business, in terms of what I've done for work, I am well acquainted with their work because reading them is usually one of the best ways to bone up on new google algo changes, and to see what's working and not working, which by the way for long term site health, is also good to know as to what's going to be hit next by google in their never ending fight against spammers, a fight they are generally not winning, by the way, at least judging by the amount of spam in their serps, but it's a hard game.
Posted Jun 22, 2014 10:24 UTC (Sun)
by anselm (subscriber, #2796)
[Link] (23 responses)
Please explain how, e.g., a microformat that is intended to allow me to conveniently and reliably extract somebody's address information from their web site into my address book will be gamed by the SEO industry to produce spam.
Posted Jun 22, 2014 14:43 UTC (Sun)
by khim (subscriber, #9252)
[Link] (21 responses)
Easy. If people will actually start actively using such microformats then spammers could produce legitimately-looking pages of various companies (with correct address information presented on pages!) which will send users to spammer-provided location only when they will try to actually use entry from their address book. Microformats could only exist in obscurity or in controlled conditions (e.g. on the intranet). Otherwise they will be exploited sooner or later.
Posted Jun 22, 2014 15:56 UTC (Sun)
by anselm (subscriber, #2796)
[Link] (19 responses)
Unlikely. I would expect a »microformat address extractor« to present the extracted address record to me before actually adding it to the address book, with an opportunity to delete or edit any field, or to reject the record outright. Obviously doctored URLs and such wouldn't make it into actual address books. (The extractor could also check blacklists like Spamhaus DBL to see whether the domains of any of the URLs in the record show up there, and emit suitable warnings or reject a record automatically.)
This is a quality-of-implementation issue on the part of software that deals with microformats. By your reasoning, e-mail should have been abolished long ago because most e-mail messages are spam. Yet strangely, for many of us e-mail is still a useful resource.
Posted Jun 22, 2014 18:46 UTC (Sun)
by bronson (subscriber, #4806)
[Link] (7 responses)
I remember people trying that a decade ago. It's too much work -- you need easy-to-use software to discover the record, present it without irritating the user, correct it, and save it somewhere useful. And it must be upgraded as the formats evolve. It sounds great in theory but the end result is always a mess that only appeals to the tiny nerd demographic.
I've got to admit, I'm surprised that not even the calendar event microformat has caught on. That seems so simple and so useful that it's almost cheating. But nope, basically zero uptake.
Posted Jun 22, 2014 22:50 UTC (Sun)
by anselm (subscriber, #2796)
[Link] (6 responses)
This is really a chicken/egg problem. Few people bother to publish data using microformats because there isn't a lot of software that uses them. On the other hand, few people feel the need to write software that uses microformats because they are so little used.
I don't think we need to appeal to the »nerd factor« to explain why we don't see more microformat-enabled web sites.
Posted Jun 23, 2014 2:02 UTC (Mon)
by bronson (subscriber, #4806)
[Link] (5 responses)
When there's demand, it's no problem. When there's very little demand, though...
Posted Jun 23, 2014 8:10 UTC (Mon)
by anselm (subscriber, #2796)
[Link] (4 responses)
Fair enough.
However, let's not forget that the initial question was how a SEO scammer would actually game microformats such as vcard information, to which khim posted a frankly ridiculous and easily refuted answer.
So suppose for a moment that microformats were indeed popular and widespread. How would a spammer arrange to, say, subvert an event microformat when (a) getting data from the page with the event on it into someone's calendar requires an explicit action (unlike spam e-mail, where the spam shows up in your inbox without an explicit action on your part), and (b) the microformat extraction tool implements reasonable and straightforward safeguards similar to those found in most web browsers or MUAs (e.g., the tool would enforce that an event listing in a microformat can only contain URLs with the same domain as the page it is on)?
Posted Jun 24, 2014 6:41 UTC (Tue)
by bronson (subscriber, #4806)
[Link] (3 responses)
Now a bit of speculatin' on the problems that a good, general purpose microformat tool might have to overcome...
First problem might be that people will click absolutely anything with an OK button. When adding calendar events that's no big deal (worst case, alarm at 3:30 in the morning: Buy \/ia9ra!) Other formats would need to be more careful, but how do you do that without harming convenience and usability?
Another problem, seems like users would probably get desensitized to notifications. (on every page: "Add Slate to your address book!" "Add Slate to your address book!") This sounds like a problem like browser popups: only solution is to keep tweaking heuristics until you find an adequate compromise between advertisers and users.
Also, scope. How does one microformat tool integrate with all your other applications? Contacts (email, mail, phone, IM, etc), calendar, social graphs, outlines, CV, and all the other ones? Seems like it's got to be built into the browser?
Those seem fairly insurmountable to me. But then again, so did Linux graphics drivers, Wikipedia, and YouTube's business model. I'd be more than happy to be proven wrong.
Posted Jun 24, 2014 9:53 UTC (Tue)
by krake (guest, #55996)
[Link] (2 responses)
It could be something unobstrusive, like the RSS icon that appears when there is an RSS source available.
This way the user would initiate the data transfer.
> Seems like it's got to be built into the browser?
I guess that makes most sense, the difficulty would be to find some kind of plugin API so that browsers could easily hook into the paltform's data services.
Posted Jun 24, 2014 18:26 UTC (Tue)
by bronson (subscriber, #4806)
[Link] (1 responses)
Posted Jun 25, 2014 6:58 UTC (Wed)
by krake (guest, #55996)
[Link]
They are a trade-off between spacial needs and clarity.
I find the RSS icon useful, I've added many feeds to my reader that way.
But such an icon is just one option, if there are better unobstrusive ways to make the user aware of the additional content then they are obviously also valid and probably preferably choices.
This was just an example that enabling users to access additional content does not require interrupting their present task via a popup.
Posted Jun 22, 2014 19:29 UTC (Sun)
by khim (subscriber, #9252)
[Link] (10 responses)
They wouldn't make in your address book, sure. But people like you and me are insignificant minority. Even if only just half of users will add the spoofed address it'll still be a will. What will it change? People will just not look there! E-mail is kept alive by it's network effect. New service with such a huge percentage of spam will have no chance. We've seen it happening with Jabber: Google tried to “make it work” for years but eventually gave up since most users who tried to talk with GTalk users using not official client but something else were spammers.
Posted Jun 23, 2014 9:32 UTC (Mon)
by zlynx (guest, #2285)
[Link] (1 responses)
Really? I suppose that this could be true.
I rather think that the majority of users were using multi-protocol chat clients like Pidgin, Empathy, Adium, etc.
Out of the people I know personally, the only official GTalk client we use is the one included on Android phones. *Everyone* else uses a unified IM client.
How else would we talk to people on Yahoo, AOL, MSN and Google without running multiple chat clients? Which is just silly.
Posted Jun 23, 2014 14:05 UTC (Mon)
by Cyberax (✭ supporter ✭, #52523)
[Link]
Posted Jun 23, 2014 16:07 UTC (Mon)
by krake (guest, #55996)
[Link] (7 responses)
What?
I've been using Jabber for years, including two gmail accounts and I have not received a single spam messsage. ever.
With email I receive several on each account every day.
Since I have never used GTalk I can only assume that this is a flaw of that particular client software.
Posted Jun 23, 2014 18:21 UTC (Mon)
by raven667 (subscriber, #5198)
[Link] (6 responses)
I don't think federation became super popular because users are used to using multiple clients, or multi-protocol clients, and having accounts on each IM system they use, the problem has been worked around sufficiently that it is in no ones particular interest to consolidate more even if that would be better for the system as a whole, like a Nash Equilibrium.
Posted Jun 23, 2014 18:53 UTC (Mon)
by krake (guest, #55996)
[Link] (2 responses)
I was puzzled by the claim that there was rampant spam on Jabber.
So I was wondering if that was somehow a flaw in the GTalk client the other poster referred to.
Sure, I could have been just very lucky but given the amount of spam I've received in the same time frame via email, I don't consider that very likely.
Posted Jun 23, 2014 23:01 UTC (Mon)
by mathstuf (subscriber, #69389)
[Link] (1 responses)
Posted Jun 24, 2014 7:10 UTC (Tue)
by krake (guest, #55996)
[Link]
Good to know though, thanks!
Posted Jun 23, 2014 23:09 UTC (Mon)
by dlang (guest, #313)
[Link] (2 responses)
Posted Jun 24, 2014 2:35 UTC (Tue)
by raven667 (subscriber, #5198)
[Link] (1 responses)
https://www.eff.org/deeplinks/2013/05/google-abandons-ope...
Posted Jun 24, 2014 5:57 UTC (Tue)
by dlang (guest, #313)
[Link]
Posted Jun 23, 2014 21:27 UTC (Mon)
by ballombe (subscriber, #9523)
[Link]
The fact that these records follow a well-known specification means that I can process them using generic RDFa tools instead of debian.org-specific tools.
Posted Jun 23, 2014 15:53 UTC (Mon)
by b7j0c (guest, #27559)
[Link]
Posted Jun 23, 2014 15:50 UTC (Mon)
by b7j0c (guest, #27559)
[Link] (5 responses)
doesn't matter, you're not on your own private internet, your content is indexed by the same major search engines as everyone else, and they care very deeply about mitigating blackhat SEO....so by extension, you must play by these rules also
furthermore, the search engine has a better "global" ontological view than you do. meaning, google etc can better categorize your content in the context of the rest of the network, because it *sees* the whole network, you don't. user-defined metadata tends to overstate the importance of what is on the page and tends to elevate it ontologically for no good reason.
sorry, the days of users describing their own data passed a long time ago
Posted Jun 23, 2014 16:15 UTC (Mon)
by krake (guest, #55996)
[Link]
Sure, but I think what tialaramex is saying is that not every piece of content on a given path is solely there for the purpose of a search engine.
A search engine crawler is not the only software reading a document. E.g. browser tend to read web content, quite some mobile apps do, etc.
Posted Jun 23, 2014 16:37 UTC (Mon)
by tialaramex (subscriber, #21167)
[Link] (3 responses)
Did you ever stop to wonder what Google search is actually doing? You talk about them trying to defeat "blackhat SEO". They don't care what colour hat anybody is wearing. They want to direct users to what they were looking for. No matter how white your hat is, if the other guy has what the user actually wants, Google's goal is to direct them to that site, not yours.
That's why SEO is so sad, ultimately - the best optimisation "trick" is to provide the content people were actually looking for, that's why for many words and phrases the top hit is a Wikipedia page about that thing.
Rather than obsessing over the use of metadata for self-description to drive search traffic, like an SEO, try to think about all the other machines using the web. Most robots are not indexing for a search engine, and most non-human agents reading a web page aren't robots (they are instead various plug-ins and proxies "eavesdropping" on pages read by humans). All of these systems can benefit from metadata, even though it doesn't help SEO. Because, SEO _doesn't matter_, it's the seagulls fighting over the scraps thrown from the ship. Seems important to the seagulls no doubt, but the ship doesn't exist to feed seagulls.
Posted Jun 23, 2014 19:38 UTC (Mon)
by Cyberax (✭ supporter ✭, #52523)
[Link] (1 responses)
Cialis, viagra, mortage relief, cheap replica rolax.
Posted Jun 24, 2014 10:06 UTC (Tue)
by nye (subscriber, #51576)
[Link]
I guess it keeps you regular?
Posted Jun 24, 2014 16:31 UTC (Tue)
by b7j0c (guest, #27559)
[Link]
Posted Jun 27, 2014 3:05 UTC (Fri)
by ras (subscriber, #33059)
[Link]
From a brief read, microformat's appears to be yet another attempt at serialising data in a human readable form that is still easily parsed by a machine. Granted it's a bit unusual in that it doesn't define a format, but rather re-uses existing formats like HTML, JSON, and XML. The core of it appears to be defining naming conventions, and how they should be applied to those formats.
If I was to evaluate it, it would be on whether the naming convention is worth using as opposed to just rolling my own. That could depend on all sorts of things I guess - like availability of compiling and parsing libraries, how many people are likely to want to parse the data I published, how frequently it changes, and who would be using. The one thing it is unlikely to depend on is what a search engine thought of the resulting HTML. And the one thing it absolutely would not depend on is whether it might be adopted by soe practitioners.
Microformats turn 9 years old
Microformats turn 9 years old
Microformats turn 9 years old
Microformats turn 9 years old
Microformats turn 9 years old
Microformats turn 9 years old
ORLY?
It's a bit of both. I know couple of guys who are doing fine-tuning of Google ranging algorythms and they use popular sites like Wikipedia as a yardstick: if some change kicks out really popular sites like Wikipedia or CNN out from the list of results then they double-check everything (and sometimes tweak the algorithm to exclude some “signals”). This obviously introduces bias but it's not exactly puts Wikipedia into the results as a hardcoded entry.
ORLY?
no one uses them, if they did, search engines would ignore them
no one uses them, if they did, search engines would ignore them
no one uses them, if they did, search engines would ignore them
no one uses them, if they did, search engines would ignore them
no one uses them, if they did, search engines would ignore them
no one uses them, if they did, search engines would ignore them
no one uses them, if they did, search engines would ignore them
no one uses them, if they did, search engines would ignore them
no one uses them, if they did, search engines would ignore them
no one uses them, if they did, search engines would ignore them
no one uses them, if they did, search engines would ignore them
no one uses them, if they did, search engines would ignore them
no one uses them, if they did, search engines would ignore them
no one uses them, if they did, search engines would ignore them
E.g. a "Contact" icon appearing when there is an embedded contact information.
no one uses them, if they did, search engines would ignore them
no one uses them, if they did, search engines would ignore them
no one uses them, if they did, search engines would ignore them
Obviously doctored URLs and such wouldn't make it into actual address books.
I would expect a »microformat address extractor« to present the extracted address record to me before actually adding it to the address book, with an opportunity to delete or edit any field, or to reject the record outright.
By your reasoning, e-mail should have been abolished long ago because most e-mail messages are spam. Yet strangely, for many of us e-mail is still a useful resource.
no one uses them, if they did, search engines would ignore them
no one uses them, if they did, search engines would ignore them
The answer is "we don't", of course.
no one uses them, if they did, search engines would ignore them
How on earth would that possibly work?
no one uses them, if they did, search engines would ignore them
no one uses them, if they did, search engines would ignore them
In my 6 years of using Jabber (probably longer, the current client's logs go back to 2008) I've never received a single spam message on any of the accounts on any of the servers (including gmail).
no one uses them, if they did, search engines would ignore them
no one uses them, if they did, search engines would ignore them
no one uses them, if they did, search engines would ignore them
no one uses them, if they did, search engines would ignore them
no one uses them, if they did, search engines would ignore them
no one uses them, if they did, search engines would ignore them
trust. For example, I trust *.debian.org hosts and they do publish RDFa records, which I use.
no one uses them, if they did, search engines would ignore them
no one uses them, if they did, search engines would ignore them
no one uses them, if they did, search engines would ignore them
"doesn't matter"
"doesn't matter"
"doesn't matter"
"doesn't matter"
no one uses them, if they did, search engines would ignore them
