|
|
You are viewing the most recent 9 entries March 31st, 200505:23 am: Simple Attention Modelling
DVDs, tv and movies in theaters have periods where unwanted stuff is presented to you along with the core entertainment. Foo. This is expensive. For example, some folk bill their time at $50/hour, hence the 12-minute fluff (about upcoming movies, no popcorn, the FBI spam about copyright infringement) costs $10, more than the ticket price. Worse, there is no comparable choice to allow market forces play. DVDs also have 'no skip' areas, presumably at the request of the DVD cartel. Of course, one does not have to comply with all the details of a provider's business model, hence such well-established practices as tape-fast-forward-thru-commercials and PVR ad-skipping. It looks like the common feature is timed periods of fluff, varying by technology (tv,movie,dvd) and channel (Turner Classics; local movie theatres), and performance of the entertainment. The common first-order response is selective inattention (tv,dvds), fast-forward (pvr, tape) or physical absence (coming late to movies). The general response to this is variation (unknown start-of-real-movie, variation of ad breaks in tv, context dependence for fluff presentation in DVDs). Bulk modelling of this kind of stuff is often done with a few tables, which is blessedly brief. create table technology (tname varchar2(10)) ; create table channel ( cname varchar2(40)); create table entertainment (ename varchar2(40)); create table perf(pid integer, tname,cname,start_time date); create table fluff (ffid integer, offset integer, duration integer); create table fluff_intervals(fid integer, tname,cname,ename,pid,ffid); This should actually work in type-lax databases like sqlite. There are other things one might choose to model, but this is enough to at least query fluff and support the common consumer responses. Assuming these are filled with adequate actual data, one can do things like: tell me fluff intervals for the current entertainment: select offset,duration,ffid from fluff_intervals fi, fluff f, perf p where f.ffid=fi.ffid and p.pid=1024 Given this, for example in a technologically mediated environment (like dvds) one could call up an app which would turn off the sound and cover the DVD player with an alternate app (your word processor, or a browser, perhaps). In physical media this would, for example, support late-coming to movie theaters. Filling these tables (best done by many hands) is left as an exercise for the reader, as are the creation of technical means and/or publication of the relevant details for those latecomers so beloved of the movie houses.
March 14th, 200510:30 pm: Generalities About Endings
We are complex biological beings, and hence eventually reach a clearly defined non-functional quiescent state (death). Contrast this with, say, a treasured family heirloom, hundreds of years old, slowly becoming scarred, losing bits and pieces, but otherwise mostly functional. We can compare ourselves with other biological beings (rabbits, bacteria, viruses). This places us in the low-reproduction-rate, long-lived, charismatic megafauna. For us, time of death can occur in decade-long time scales, and has clear population statistics. (See some california 1998 stats; this can depend on social arrangements and has varied over human history.) Roughly speaking we see almost everybody dies by age 100, with rapidly increasing mortality over the years after 70 and a bump in deaths before age 1. Going up one level in aggregate eschatology, we can consider historical human die-backs (European Black Death, American Indians). I note that the Black Death left a majority of the population untouched (Black Death was probably a single malady or a few related maladies) and that considerably more than 1% of the original population of American Indians survived after exposure to the full European catalogue of infectious disease. Despite repeated human-engineered attempts to eradicate rabbits in Australia (who live in high-density clumps), I still see'em. Thus the canonical self-replicating technology, disease, does not seem to be effective as a species-destroying agency (at least for widely-spread species (which we are)). Bill Joy highlights self-replicating technologies (biological and nanotech) as the most efficient method of global destruction, and correctly emphasizes nanotech as the most precise technology, with the least likelihood of blow-back (killing the creators). Ever since the 2001 anthrax release in the US we have to consider ourselves as intelligent Australian rabbits, potentially subject to attack by engineered disease. From an eschatological viewpoint, what matters is whether an attack leads to extinction rather than die-back. Rational opponents would not want to solve the hard problem of global die-back and don't need anything like it. Any serious percentage of deaths in a wealthy nation will lead to dramatic social and economic changes. This is probably the most precise useful outcome of such a chancy method of attack. (Note that anthrax, developed as a military weapon has only local affects. This is not a coincidence.) Thus we expect system disruption at smaller scales long before the use of Joyian global destruction. In the longer term a moderately funded small-group opponent may wish to follow Bill's nanotech option. It is worth while to spend some time, however, thinking about the system properties of the human population that have made it resistant to extinction.
February 17th, 200505:39 pm: Level of detail: Chomsky and the 2000 election
I like Chomsky's emphasis on the usefulness of non-experts, so I thought I would briefly explore his riff on elections and coin-tossing with my modest statistical literacy. Chomsky says "Under what conditions would we expect 100 million votes to divide 50-50, with variations that fall well within expected margins of error of 1-2 percent? There is a very simple model that would yield such expectations: people were voting at random." I ran down a suitably official source on the 2000 elections he talked about. Trying to keep the investigation simple, I looked at the presidential race. Further, I ignored the reality of the electoral college. Probability and statistics can be applied to most count-based situations, but you do need to play by the rules. Taking the simplest analytical framework, we take "random" as "independent choices with probability of voting Democrat 50%". In this framework, the "1-2%" error would be astonishingly large. More precisely, if 100 million voters independently voted Democrat with 50% probability, one would expect the standard deviation (sigma) to be sqrt(100 million)/2 or 10000/2=5,000. Given the large numbers of trials, this simple model yields 3 sigma ranges (in a Normal distribution this covers 99.73% of the probable values) of plus or minus 15,000 or about 0.03% jitter. Actual winning margins were about 500 thousand, in statistical company with Elvis, alive in 2000, shooting pool in a swimming pool while Marilyn Monroe dances on the diving board. So I am in violent agreement with Chomsky's "the simplest one is not strictly valid". Even two-party models are suspect: I note that the minor parties are significant. In fact, the Greens, with 2.5 million voters seem at least as important as Canova's 1.4 million disenfranchised blacks. Statistical analysis can also be applied at various levels of granularity. Ignoring for the moment the minor parties, current and former prisoners and the complaints of the Right about liberal bias, we can look at actual data at the state level, concentrating on Democrats and Republicans. We calculate the relevant 3 sigma range based on each state's total Democrat and Republican voters. sqlite> select state from stats where abs((dem+rep)/2 - dem) < threesig; Florida New Mexico So only Florida and New Mexico fit a sharpened Chomsky hypothesis. Perhaps other states are close, 'tho: sqlite> select state,abs((dem+rep)/2 - dem)/threesig sigs from stats sigs where abs((dem+rep)/2 - dem) < 3*threesig; Florida|0.0741657620022538 Iowa|1.22434286508249 New Mexico|0.161141254302249 Oregon|1.88314705988215 Wisconsin|1.20812975243626 So Iowa, Oregon and Wisconsin are maybe sorta statistically interesting. But every other state fails the sharpened Chomsky model. Decisively. Elvis and Monroe territory. Overall, even a statistical expert (possibly using the Chi-square goodness-of-fit test, even fitting the nationwide Democrat vote to reality) would have to conclude that the sharpened Chomsky model fails badly as a brief description of the entire country. Indeed, a third year physics or social science university student should be able do this up right as part of their school work. The numbers do show that Florida and New Mexico were really close. If you believe that elections should not be decided because of some folk flipping a coin, then Florida and New Mexico will bother you deeply, since even a small percentage of folk doing this could have changed the winners. I don't. Probability is a fine analytical tool for predicting and summarizing counts, but lacks the finality (subject to legal complications) of a count. Flipping a coin, while a decent alternative in close cases, is not something that will finish Presidential races in the US in our lifetime.
February 11th, 200506:38 am: Lo Technicalities on GIS and other Summaries of Websites
I am part of the support group for the EPA's NSDI node, which provides the motivation for making occasional maps. There are Enterprise tools (ESRI and others) and open source ones (e.g. UMN map server; GRASS) that can help a lot. Those tools have some overhead to master/cost to pay. I thought I would do the TriBlogConf flyer with lo-tech publically available resources. In principle, for small projects, this means anybody with an internet connection and some patience can follow along. It is suprisingly difficult to find an legally usable World Map. The CIA provides a couple, which are public domain but a little detailed for my purposes. I used the Blank Map at Wikipedia; since it is licensed under the GFDL the maps below are too. Use and modify but don't stop others modifying your copy. Finding lat-longs from place names (geo-coding) is possible at http://www.astro.com/atlas or http://www.mapquest.com/maps/main.adp?formtype=latlong . Mapquest can also do business lookups at http://www.mapquest.com/maps/main.adp?formtype=search. I used Google to find the businesses mentioned, and their 'about',' 'jobs' or 'investor relations' pages to find a physical address. A couple of sites were exceedingly coy about their snail address. The rest was at least in principle entirely doable by hand. I made a file of names and urls to track things. I used Perl, curl and sqlite for convenience, but I'm comfortable with these and had a time budget. Anything much larger (say thousands of locations) would require automation of some similar sort. I suppose you could twist (MS|Open)Office to provide the same general level of automation as the above canonical Open Source toolkit, but that is, perhaps, best left for folk doing Office suite automation anyway. I edited the blank map by hand in: Photoshop LE (came with my kids Xmas Wacom tablet: fast and solid, not over-featured for this), an out-of-date KDE photo editor (too slow) and the GIMP (fast and solid, if quirky). In a testament to .png interop, I found no problems as I moved the image in progress among the various programs. Finding the Average Name was a hoot. I got Perl to slice up names into characters and calculate essentially the offset from 'a'. SQL has an 'avg' function, and so something like select firstorlast,pos,avg(charnum) from avgnames group by firstorlast,pos,charnum order by firstorlast,pos did most of the work. I made the flyer as a web page in Mozilla Edit Page. There is suprisingly wide variation in print formatting between browsers (I used Mozilla, Firefox and Safari). Nearly enough of a bother to can the idea and go with the Office Suite (MS|Open) alternative. A large amount of the time spent on the project was spent estimating where to put dots on the map from map views of the appropriate lat-longs. Real GIS software integrates this well and could have converted this part to a triviality. Real GIS systems also have hooks to do geocoding in an integrated fashion, which would have taken another large chunk out of the time spent. Getting images into livejournal seems to be done with flikr or buzznet: I used buzznet. UpdateFeb 2006. Moved images since buzznet no longer resolved for me.
06:11 am: TriBlogConf Flyer
I gathered the websites mentioned in the conference page as of February 2, then read the first post and page top for mentions of places. I found the lat-longs of these places, and calculated an average: the red dot in the 'small map' below  . Triangle bloggers talk more American East Coast than Mid-Atlantic.  I can show more detail in the map above since nobody mentioned Russia, the Malay-Indonesian archipelago, China, India, South America, Australia or almost any part of Africa. Size of dots maps to number of mentions. Countries or states were mapped to their capital, so EU is represented by Brussels and Vermont by Montpelier. The Triangle is the most popular area, big cities in the East Coast follow, then Google, Apple, and Microsoft. I averaged names of participants: the official Average Name for the Conference is Jill MjmlkmDisclosure: this is in part a side-effect of my work responsibilities at CSC.
January 28th, 200503:15 pm: Runtimeaccess vs REST/xmlrpc - Other ways of accessing a running program
Folks at http://www.runtimeaccess.com/ want to make network services more inspectable, using postgresql's protocol as the network glue.
Of course, transparency of inspection for running programs is important for debugging them, hence the full transparency for single-machine applications implied by, for example, gdb, gud-mode in emacs (covers Perl, python, etc) and the typical add-on for ides. Not surprisingly, these environments typically assume single-language environments and are more rarely implemented cross-network.
Runtimeaccess's paper talks about two things of interest here:
- Suggests that native data structures should be inspected as if they were relational tables
- Provides simple access to the postgresql network layer and some SQL access
It is possible to go whole-hog with the relational table approach and keep all application state in real (in-memory) relational tables. One suggestion for sqlite is, in fact, to use a real embeddable database for significant parts of application state. However, running programs commonly generate tree-like structures (whether they should or not).
Runtimeaccess' thesis is that a useful part of many applications' state is expressible as arrays-of-structs, which can be routinely matched to relational tables, and that postgresql's protocol can take care of many details of network exposure and query language.
Since I work with web sites, I would prefer HTTP as my network protocol, and although I like relational databases, decreeing that the core data structures of a complex application are like normalized relational tables seems a little excessive. If we want to expose an extensive subset of the usual recursive data structures in a language, we can use XMLRPC's encodings to create XML documents from data structures. XMLRPC has a particular view of how HTTP is to be used to move these rigidly encoded data structures, which works well for half the uses (statistics gathering) that one wants to do. XMLRPC also has a view about what to do with the consumed XML responses (translate them immediately into native data structures of the requester) which does not map well to the native query language in XML (xpath). Finally, there is no default subset of XMLRPC devoted to partial update (UPDATE in sql) and creation (sql's INSERT). Xupdate is the simplest xml-native way of doing this in the XML world, and is adequate for many cases.
To summarize:
- We want a concise query and update language for internal data structures; XML has good support for this on complex XML trees; many programming environments lack this (e.g. c's lack of reflection).
- We need to map large tree-like data structures into XML: XMLRPC's serialization rules handle a lot of this well.
- HTTP is an adequate network access protocol; heavily supported in most environments.
- Updates and queries are probably best handled uniformly by the XML machinery associated with xpath.
This suggests that we rely on HTTP as our network access protocol, use XMLRPC serialization to map from data structures to the XML on the wire and use XML tools for the query language and update protocol.
Sounds doable.
January 21st, 200509:25 am: Redacted pdfs: Xerox and Cut, don't Blacklight
I was looking at the redacted report on the iraq war and am inspired to describe an effective manual procedure for redaction on public pdf documents.
The problem: what you black out may still be seen.
The pdf format has lots of room for hidden data, so even if Adobe Acrobat can't/won't see this data, some geek can pull it out of the document directly. Further, some methods of redaction can leave residual information available even from the visible form.
Solutions:
- Commercial solutions should work in most cases; the developers should be able to find all the places in the flexible pdf format where the relevant information is hidden by common pdf creation software and remove the equivalents to the redacted pieces. A google search on 'redaction pdf' is enlightening. For extra assurance check with the provider that they have analyzed the version of software that made your pdf.
- Moving pdf documents through paper leads to at least one robust manual procedural technique.
A reasonable paper-based procedure for redaction goes as follows:
- Print the pdf on good quality paper at a good resolution (office laser printing will do in a pinch).
- Cut the areas to be redacted out, with a sharp knife. Do not black them out with a dark pen. Fix any structural problems this leaves you with tape.
- Scan the resultant redacted paper back into pdf, review it and publish when you are satisfied that you really have redacted all the pieces you wanted to.
- If you need to merge a redacted text version of the document (for accessibility for example), redact by hand and merge with your generated pdf. Check that the redactions match, of course, before doing this.
Why do we print the pdf? Because unless the application or printer software is severely compromised, the printed, now source, document does not have any representation of the original other than the human-readable visual image. A severely compromised bit of application or printer software could get around this by, for example, adding very small dot patterns to the page that encode the full ascii text of the pdf. This technique is related to steganography and watermarking, but is probably not used in current commercial printer software.
Why do we cut rather than highlight? Because we don't want the redacted print available for view in any form. For example, highlighting in black merely compresses the range of values of darkness we use for reading so that the white background looks very similar to the dark text. But an attacker can get his computer to re-scale these back and see a (probably fuzzy) version of the original.
This is an example of expensive-but-effective paranoia. My favourite part of the practice is that simple manual procedures can provably invalidate the highest-tech attacks on a fundamental pdf weakness. And that the only ways around it involve extra cost (and probably government coercion) for the application or printer driver manufacturers.
January 13th, 200505:14 pm: What to do when your cat does not like you
Choices:
- Avoid the cat.
- Heal the scratches.
Cats as metaphor for social control would lead one to believe that the options are prevention, as we appear to be doing post 9/11, or increased hospital capacity.
Obviously cats are not a good metaphor. For example we note the lack of an analogue for police work (partly because most folk know who their cats are, and can identify the culprit from the wound). On the other side, social control does not normally include reactions to purring, nor the threat/response to non-use of a cat box.
The best analytical framework for cats remains unknown.
04:53 pm: RSS implementer's thoughts
Just dumping from a database into RSS.
Not bad as these things go:
- Target is text-oriented, with well-defined tags and surface syntax
- Has support for character-set designation, with native Unicode as back-up
- Has a validator (I used (divinto)Mark's)
- Has an auto-discovery mechanism that is supported by Firefox
I tried to keep close to the blog use-case:
- A few tens of records, limited to the last day or so
- Reverse chronological order by modify time
Problems:
- Needed to use RSS 2.0 and go with author names rather than email addresses (spam is important)
- Had to align cache strategy with RSS updates for a fast response when users hit the link
- Validation is not fully possible with Relax or XSD
Powered by LiveJournal.com
|