I’m Training AI Chat Bots (Non-Consensually)

The Washington Post has published an article looking at the websites used to train “Google’s C4 data set, a massive snapshot of the contents of 15 million websites that have been used to instruct some high-profile English-language AIs, called large language models, including Google’s T5 and Facebook’s LLaMA.” If you scroll down far enough, there’s a section titled “Is your website training AI?” that lets you drop in a URL to see if it was scraped and included in the data set.

I checked three strings — “michaelhans” (to cover both this site and its prior address at michaelhanscom.com), “djwudi” (for my DJ’ing blog), and norwescon (which I’ve written or tweaked and edited much of the content for). All three of them are represented.

  • norwescon.org: 45k tokens, 0.00003% of all tokens, rank 528,147
  • michaelhanscom.com: 37k tokens, 0.00002% of all tokens, rank 635,948
  • djwudi.com: 3.7k tokens, 0.000002% of all tokens, rank 4,002,025

For the record, I’m not terribly excited about this. I’m also under no illusion that anything can be done; this stuff is all out on the open web, and as it’s free for actual people to browse through and read, it’s also free for bots to scrape and ingest into whatever databases they keep. Sometimes this is a good thing, for projects like the Internet Archive. Sometimes it’s unwittingly helping to train our new AI overlords.

Managing Inbox Overload with Google Buzz

My Google account just got set up with Google Buzz, the new social networking addition to Google’s stable. One of the first things I noticed was that this could be a recipe for inbox overload, as every new reply to something I’ve posted or replied to ends up as a new message in my inbox.

Inbox Overload

Simple solution: set up a filter. Here’s the settings I used…

  1. Click the “Create a filter” option just to the right of the search box and related buttons towards the top of the screen.

    Create a Filter

  2. Enter “Buzz” in the “Subject” field of the filter options box, then run a test search. Unfortunately, this will catch any message that uses the word “buzz” in the subject line, and from my testing, neither adding a colon (“Buzz:”) or surrounding the word with quotation marks makes a difference. I can’t currently find a way to force the filter to grab only messages that begin with the word “Buzz” so caveat emptor. If your test search looks acceptable, click “Next Step”.

    As has been pointed out to me by a few people, and posted here: Enter “label:buzz” in the “Has the words:” field of the options box. Google will pop up a warning, but go ahead and ignore it.

    Filter Options Screen One

  3. In the next screen, activate “Skip the Inbox (Archive it)” and “Apply the label:”, then create a new label titled “Buzzes” (or whatever you want, but you can’t use “Buzz”). If you want, click the checkbox to apply the filter retroactively to the messages caught by the filter’s test run. Then click “Create Filter”.

    Filter Options Screen Two

  4. You’re done!

From now on, rather than getting flooded with inbox messages every time a new Buzz response pops up, you’ll have a little ‘Buzzes’ filter sitting to the left of your screen. If it’s bold, you’ve got a response waiting for you. And that’s it!

Links for January 29th through January 30th

Sometime between January 29th and January 30th, I thought this stuff was interesting. You might think so too!

  • Do Humanlike Machines Deserve Human Rights?: "This question is starting to get debated by robot designers and toymakers. With advanced robotics becoming cheaper and more commonplace, the challenge isn't how we learn to accept robots–but whether we should care when they're mistreated. And if we start caring about robot ethics, might we then go one insane step further and grant them rights?"
  • On the Flickr support in iPhoto ‘09: From Fraser Speirs, author of the excellent Flickr Export plugin for iPhoto and Aperture: "I acquired my copy of iLife ‘09 yesterday and decided to dive deep on how Apple have implemented Flickr integration in iPhoto ‘09. Here are the results of my investigation. Be aware as you read that this is the result of a morning’s click-around investigation and not months of serious use. I will do my best to give an honest assessment of what is in iPhoto ‘09, and you’ve already read my full disclosure in the previous paragraph."
  • Google School: Find Images by Exact Dimensions, Make Wallpaper Search a Breeze: "Weblog Design Live uncovers the undocumented search operator (that's also new to us) and demonstrates how to use it. Just use the imagesize operator followed by the WidthXHeight in pixels." For instance, imagesize:320×480 goth finds iPhone/iPod Touch wallpaper ready 'goth' images (for a potentially odd interpretation of 'goth', that is).
  • White House Unbuttons Formal Dress Code: "The capital flew into a bit of a tizzy when, on his first full day in the White House, President Obama was photographed in the Oval Office without his suit jacket. There was, however, a logical explanation: Mr. Obama, who hates the cold, had cranked up the thermostat. 'He's from Hawaii, O.K.?' said Mr. Obama's senior adviser, David Axelrod, who occupies the small but strategically located office next door to his boss. 'He likes it warm. You could grow orchids in there.'"
  • Create Your Own Original Star Trek Story: The original Star Trek only managed to make 80 episodes before running out of Dilithium. Not enough! So we mixed up the show's most frequent plot twists, to create a foolproof Trek story generator.

rel=“nofollow” : Massive weblog anti-spam initiative

Wow. Straight from Jay Allen:

Six Apart has announced in co-operation with Google, Yahoo, MSN Search and other blog vendors a massive joint anti-spam initiative based on the HTML link type rel="nofollow".

The initiative is based upon the idea of taking away the value of user-submitted links in determining search rankings. By placing rel="nofollow" into the hyperlink tags of user-submitted feedback, search engines will ignore those links for the purposes of ranking (e.g. PageRank) and will not follow them when spidering a site.

[…]

It is important to note that while the links will no longer count for PageRank (and other search engines’ algorithms), the content of user-submitted data will still be indexed along with the rest of the contents of the page. Forget all of those silly ideas of hiding your comments from the GoogleBot. Heck, the comments in most blogs are more interesting that the posts themselves. Why would you want to do that to the web?

Now, the astute will point out that because links in comments/TrackBacks are ignored by the search bots, the PageRank of bloggers all around the blooog-o-sphere will suffer because hundreds of thousands of comments linking back to their own sites will no longer count in the rankings. And that is most likely true. But that inflated PageRank, which was a problem created by the search engines themselves, is the rotting flesh that the maggots sought out in the first place. If you ask me, I say fair trade.

In the end, of course, this isn’t the end of weblog spam. But because it completely takes away the incentive for the type of spamming we’re seeing today in the weblog world, you will probably see steady decline as many spammers find greener pastures elsewhere. That decline combined with better tools should help to make this a non-issue in the future. Every little step counts, some count more than others, and history will be the judge of all.

Very cool. Also very similar to a technique I was using a couple years back, though that was geared to blocking off areas of the site to ignore rather than affecting individual links. Either way, though, it’s a big step forward. I’m especially heartened to see the list of competing companies and weblogging systems that are participating in this.

Google to me in eight clicks

Meme time, started by A Whole Lotta Nothing, and being tracked by Kottke: how many clicks to get from Google’s homepage to your website without using the search box?

For me, it’s eight.

  1. Google »
  2. More »
  3. Blogger »
  4. Knowledge »
  5. Working With Blogger »
  6. How Not to Get Fired Because of Your Blog »
  7. Seattle Times: Microsoft Fires Worker Over Weblog »
  8. eclecticism

Getting in Google's good graces

One of the constant topics that many webmasters and webloggers are concerned with these days is Google, how to increase your site’s standing in Google’s eyes, and therefore drive more traffic to your site. I use a number of techniques on my weblog, both in the code and how I create entries, that help Google get the most useful information out of my pages.

While I’ve mentioned some in the past, the subject recently came up in a thread on the TypePad User Group, and I shared some of my methods in that thread. At the request of both Liza and Richard, who have also been posting about this topic, I’m re-posting my post (post-haste, though not post-mortem, and definitely not postpartum) here…

Still, I’m amazed to read that you had 1,000 per day BEFORE MS made you a web celeb (boo! to them). Do you think those hits came from your blogging subject or from special tactics you engaged in to increase your site traffic.

A little bit of both, probably.

First off, it’s not so much my subject, as my lack of subject. ;) Because I’ve never really focused on any specific topic for my blog, and just randomly babble about whatever crosses my mind, that gives Google a lot of potential keywords to pick up on.

Also, I’ve been at this for about three years now, so I’ve got a fairly large archive section, which also increases the probability of any given keyword turning up in a search.

As far as special tactics, there’s a few techniques I’ve picked up on over the years that seem to help (some of which you covered in your post).

  1. Descriptive headlines as a page title. The title of a webpage scores very highly in Google’s ranking scheme, so I generally try to make sure that my post titles are descriptive of what I’m posting about (“Lord of the Rings Trailer” rather than “This is cool!”), and I make sure that the post title is included in the page title.

    I believe that TypePad is set to include post titles in page titles for individual archives by default, but some weblog tools (including MovableType in its early stages, I believe, though I could be wrong) only include the site name for every page title, so instead of a site containing 1000+ differently named pages, you’d end up with a site containing 1000+ pages all named “My Weblog”, which doesn’t give Google nearly as much to work with.

  2. Setting a consistent structure for the code on each page. As HTML was designed to emulate (though not visually replicate) the structure of a printed document, it includes various structural elements such as various levels of heading. As Google pays attention to these when it scans a document, it often helps to use them correctly.

    In the past, rather than using the <h1>, <h2>, etc. elements for headlines, division markers, and so on, many sites would use <font> tags to give their subdivision headings the look they wanted. Now that the <font> tag has been deprecated and we can use CSS to style every element on a page the way we want, it’s good to return to using structurally correct markup. In addition to making a site much easier to code, it also assists Google in determining the structure, topic, and relevance of any given page.

    For each individual archive page on my site, I’ve structured it as follows:

    1. <title>: website name > post title

    2. <h1>: website name

    3. <h2>: website ‘tagline’

    4. <h3>: post title

    5. <p>: post body

    6. <h3>: trackback

    7. <h4>: trackback source

    8. <p>: trackback body

    9. <h3>: comments

    10. <h4>: comment author

    11. <p&>: comment body

    12. <h3>: comment posting form

    This gives each page a clearly delineated, easy to read structure that tells both the reader and Google which parts of the page are the most important and the most relevant to the topic of the page.

  3. Link descriptively. Simply, this involves using natural language for your links so that the link is descriptive to what it points to. For instance, saying “The new Lord of the Rings trailer is out!” instead of “You’ve gotta see this!” gives Google more information about what you’re linking to.

    This carries a double benefit, in that not only does it give Google better information about what you’re referencing, it also lets Google know more about what you’re linking to, which helps out whoever is on the target end of your link.

  4. Alt text on all images. This is important for a few reasons. First off, it lets Google know what each image is so that Google can include it more reliably in their image search feature. Secondly, though, and more importantly, it greatly improves the readability of your site for people with disabilities using specialized browsers to read the web.

    Blind users can use a “screen reader” to read websites — this is a specialized browser which translates the text to audio, and reads the page to them. Without alt text, all that screen reader can do is give them the name of the graphic, and might end up telling them something like “Image named funnypicture.jpg”. With alt text, they’ll instead hear something like “Image named Gimli falls off his horse”.

  5. Use the excerpt field to create useable descriptions. While keywords are no longer recognized by Google, another <meta> tag in the <head> section of your document still is (I think), which helps Google determine the topic of the page, and that’s the ‘description’ tag. What I’ve done is put this code into the <head> of each individual archive:

    <meta title="description" content="<$MTEntryExcerpt>" />

    I then make sure to take a moment to create an excerpt for each entry as I’m making it that relates to the topic of the post, rather than just relying on TypePad’s auto-generated excerpt (which generally just grabs the first n words of each post).

Anyway, there’s a few of the things I do which seem to help my site visibility. Mostly, though, I think a lot of it just boils down to the fact that after three years of babbling, I give Google a lot to work with. ;)

Help search engines index your site

We all know that Google is god. Chances are you’ve used Google when doing a search on the ‘net at least once, if not daily, or many times a day. If not, then I’ve heard rumors that there are other search engines out there — though I haven’t used any in so long, I can’t really vouch for the veracity of that rumor. ;)

I wanted to share a few tricks I use here to help Google (and other search engines) index my site, and to try to ensure that searches that hit my site get the most useful results.

All of the following tips and tricks do require access to your source HTML templates (in TypePad, you’ll need to be using an Advanced Template Set). While I’m writing this for an Advanced TypePad installation, the tips will work just as well in any other website or weblog application where you have access to the HTML code.

Specify which pages get indexed, and which don’t

What? One of the most important pages on a weblog from a user’s point of view is the main page. It has all your latest posts, all the links to your archives, your bio, other sites you enjoy reading, webrings, and who all knows what else. However, from the perspective of a search engine, the main page of a weblog is most likely the single least important page of the entire site!

This is simply because the main page of a weblog is always changing, but search engines can only give good results when the information that they index is still there the next time around. I’ve run into quite a few situations where I’ve done a search for one term or another, and one of the search results leads to someone’s weblog. Unfortunately, when I go to their page, the entry that Google read and indexed is no longer on the main page. At that point, I could start digging through their archives and trying to track down what I’m looking for — but I’m far more likely to just bounce back to Google and try another page.

Thankfully enough, though, there’s an extremely easy fix for this that keeps everyone happy.

How? One short line of code at the top of some of your templates is all it takes to solve the problem. We’re going to be using the robots meta tag in the head of the HTML document. The tag was designed specifically to give robots (or spiders, or crawlers — the automated programs that search engines use to read websites) instructions on what pages should or shouldn’t be indexed.

For the purposes of a weblog, with one constantly changing index page and many static archive pages, the best possible situation would be to tell the search engine to read and follow all the links on an index page (so that it finds all the other pages of a site), but not to index that page. The rest of the site, it will be free to read and index normally.

That’s very easy to set up, as it turns out. The robots meta tag allows four possible arguments:

INDEX
Read and index a page normally
NOINDEX
Do not index any of the text of the page
FOLLOW
Follow all the links on a page to read linked pages
NOFOLLOW
Ignore all links on a page

So, in order to do what we want, we add the following meta tag to our document, in the head section, right next to the meta tags that are already there:

<meta name="robots" content="noindex,follow" />

Now, when a search engine robot visits the index page of the site, it knows that it should not index the page and add it to its database, however, it should follow any links on that page to find other pages within the site. This way, searches that return hits for the site will be sure to find your archive pages for the information that is requested, rather than your front page, which may not have the information anymore.

Update: It turns out that this technique may have some side effects that I hadn’t considered, and might possibly not work at all. For more details, please scroll down to Anode’s comment and my reply in the comment thread for this post. Hopefully I’ll be able to dig up more information on this soon.

Fine tune what sections of a page get indexed

What? There is a proposed extension to the robots meta tag that allows you to not just designate which pages of a site get indexed, but also which sections of a page get indexed. I discovered this when I was setting up a shareware search engine for my old website, and have since gotten in the habit of using it. Now, this is not a formal standard, and I don’t know for sure which search engines support it and which don’t — the creator of this technique has suggested it to the major search sites, but it is not known what the final result was.

Now, why would you want to do this? Simply this: on many weblogs, including TypePad sites, the sidebar information is repeated on every page of the site. There is also certain informational text repeated on every page (for instance, the TrackBack data, the comments form, and so on). This creates a lot of extraneous, mostly useless data — doubly so when that information changes regularly.

By using these proposed tags, any search engine that supports them will only index the sections of a page that we want indexed, and will disregard the rest of the page.

How? Because this is based on the robots meta tag discussed above, it uses the same four arguments (INDEX, NOINDEX, FOLLOW, and NOFOLLOW). Instead of using a meta tag, though, we use HTML comment syntax to designate the different sections of our document.

For instance, every individual archive page on a TypePad weblog that has TrackBack enabled will have the following text (or something very similar):

Trackback
TrackBack URL for this entry:
http://www.typepad.com/t/trackback/(number)

Listed below are links to weblogs that reference (the name of the post)

In order to mark this out as a section that we wanted the search engine not to index and not to follow (as the only link is to the page that the link is on), we would surround it with the following specialized tags:

<!-- robots content="noindex,nofollow" -->
<!-- /robots -->

For example, I would change the code in the TypePad Individual Entry template to look like this:

<mtentryIfAllowPings>
<!-- robots content="noindex,nofollow" -->
<h2><a id="trackback"></a>TrackBack</h2>
TrackBack URL for this entry:<br /><$MTEntryTrackbackLink$>
Listed below are links to weblogs that reference <a href="<$MTEntryPermalink$>"><$MTEntryTitle$></a>:
<!-- /robots -->
<mtpings>

The same technique can be used wherever you have areas in your site with content that doesn’t really need to be indexed.

Now, as I stated above, this is only a proposed specification, and it is not known which (if any) search engines support it. It also requires a healthy chunk of mucking around with your template code. Because of these two factors, it may not be an approach that you want to take, instead simply using the “sledgehammer” approach of the page-level robots meta tag discussed above.

However, I do think that the possible benefits of this being used more widely would be worth the extra time and trouble (at least, for those of us obsessive about our code), and I’d also suggest that should TypePad gain a search functionality, that these codes be recognized and followed by the (purely theoretical, at this point) TypePad search engine.

Put the entry excerpt to use

What? The entry excerpt is another very handy field to use in fine tuning your site. I believe that the field is turned off on the post editing screen by default, but it can be enabled by clicking on the ‘Customize the display of this page’ link at the bottom of the post editing screen.

By default, the entry excerpt is used for two things in TypePad: when you send a TrackBack ping to another weblog, the excerpt is sent along with the ping as a short summary of your post; and it is used as the post summary in your RSS feed if you have selected the ‘excerpts only’ version of the feed in your weblog configuration. However, it can come in handy in a few other instances too. One that I’ve discussed previously is in your archive pages. However, the excerpt can also be used to help out search engines.

You may have noticed that when you do a search on Google, rather than simply returning the link and page title, Google also returns a short snippet of each page that the search finds. Normally, this text snippet is just a bit of text from the page being referenced, intended to give some amount of context to give you a better idea of how successful your search was. There is a meta tag that lets us determine exactly what text is displayed by Google for the summary, though — which is where the extended entry field comes in.

How? We’re adding another meta tag here, so this will go up in the head section of your Individual Archives template. Next to any other meta tags you have, add the following line:

<meta name="description" content="<$MTEntryExcerpt>" />

Then save, and republish your Individual Archives, and you’re done. Now, the next time that Google indexes your site, the excerpt will be saved as the summary for that page, and will display beneath the link when one of your pages comes up in a Google search.

So what happens if you don’t use the entry excerpt field? Well, TypePad is smart enough to do its best to cover for this — if you use the <$MTEntryExcerpt$> tag in a template, and no excerpt has been added to the post, TypePad automatically pulls the first 20 words of your post to be the excerpt. While this works to a certain extent, it doesn’t create a very useful excerpt (unless you’re in the habit of writing extremely short posts). It’s far better to take a moment to create an excerpt by hand, whether it’s a quick cut and paste of relevant text in the post, or whether it’s more detailed (“In which we find out that…yadda yadda yadda.”). In the end, of course, it’s your call!

Use the Keywords

What? Keywords are short, simple terms that are either used in a page, or relate to the page. The original intent was to place a line in the head of an HTML page that listed keywords for that page, which search engines could read in addition to the page content to help in indexing.

Unfortunately, keywords have been heavily abused over the years. ‘Search Engine Optimizers’ started putting everything including the kitchen sink into their HTML pages for keywords in an effort to drive their pages rankings higher in the search engines. Because of this, some of the major search engines (Google included) now disregard the ‘keywords’ meta tag — however, not all of them do, and used correctly, they can be a helpful additional resource for categorizing and indexing pages.

How? One of the various fields you can use for data in each TypePad post is the ‘Keywords’ field. I believe that it is turned off by default, however you can enable it by clicking on the ‘Customize the display of this page’ link at the bottom of your TypePad ‘Post an Entry’ screen.

Once you have the ‘Keywords’ field available, you can add specific keywords for each post. You can either use words that actually appear in the post, or words that relate closely to it — for instance, I’ve had posts where I’ve used the acronym WMD in the body of the post, then added the three keywords ‘weapons mass destruction’ to the keywords field. You never know exactly what terms someone will use in their search, might as well give them the best shot at success, right?

Okay, so now you have keywords in your posts. What now? By default, TypePad’s templates don’t actually use the data in the Keywords field at all. This is fairly easy to fix, however.

In your Individual Archives template, add the following line of code just after the meta tags that are already there:

<meta name="keywords" content="<$MTEntryKeywords$>" />

Then save your template, republish your site (you can republish everything, but doing just the Individual Archives is fine, too, as that’s all that changed), and you’re done! Now, the next time that a search engine that reads the keywords meta tag reads your site, you’ve got that much more information on every individual post to help index your site correctly.

Conclusion

So there we have it. One extremely long post from me, with four hopefully handy tips for you on how you can help Google, and the rest of the search engines out there, index your site more intelligently. If you find this information of use, wonderful! If not…well, I hope you didn’t waste too much of your day reading it. ;)

Feel free to leave any questions, comments, or words of wisdom in the comments below!