J. M. Haile
08–14 December 2009
We identify four classes of objects that should take standard forms in the Vault: genres, titles, versions, and tags. One or two
additional classes may be identified as development proceeds. The following are likely incomplete; we expect
these rules will be refined during development and with use.
While users will be encouraged to add to or edit much of the information on Vault record pages, users will not be able to directly
add to or edit genres, titles, versions, or tags on Vault record pages. Users can influence these classes on Vault pages by editing the corresponding fields
on the user record pages.
1. Genres in Standard Form
Definition. The Take11 standard form for genres will have the following characteristics:
- (a) Generally, a genre will be a single word. In a few cases, words or terms may be joined
by a hyphen or slash. At this point the only exception is "Live Performance."
- (b) The first letter of a genre will always be capitalized and the remainder of the word will be lower case.
- (c) Multiple genres can be assigned to one film; e.g., a romantic comedy can be identified by assigning
the genres "Romance" and "Comedy."
- (d) The allowed genres are restricted by the software; however, users can request to the developers that
new genres be added.
2. Movie Titles in Standard Form
Definition. The Take11 standard form for titles will be limited to one of the following forms,
- (1) The Title
- (2) The Main Title: The Subtitle
- (3) The Title (year)
- (4) The Title (year, director's last name)
These standard forms obey these rules:
- (a) In most cases, a standard title will contain only a main title or a main title and a subtitle (if the original film release had a subtitle).
- (b) Other identifiers, such as versions, collections, and marketing information will not appear in the standard title.
- (c) The main title and subtitle are always and only separated by ": ". No other punctuation is allowed,
unless it is part of the title or subtitle.
- (d) Every word in the standard title starts with an uppercase letter and the remainder of the word is in lowercase.
- (e) When rules (a)–(d) result in the same title for two movies, the two will be judged the same
(perhaps different versions) if their theatrical release years are within N years of one another;
otherwise, they will be judged to be different movies. In preliminary tests, N = 5 seems to work.
- (f) When rule (e) results in the same title for two different movies,
the standard title for each will include the theatrical release year in parentheses.
- (g) When, on applying rule (f), two movies are found to have the same title and same release year, the two will be judged
the same or different based on the names of the directors. If the two differ, the last name of the director will be included in the standard form for each movie.
Implementation. Every title cataloged in the user database will be reduced to it's standard form. All such reductions will be performed
by server-side scripts, not by individual users of the site. Users are encouraged to edit their records to use the standard form, but this is not necessary
or required.
However, there are many records in the user database for which there is insufficient information to apply all the above rules. For example, many records
may be missing the theatrical release year or the name of the director or both. In such cases, the scripts will attempt to resolve the issue by using
secondary information to compare a problematic title to well-defined ones.
- (a) If the problem record contains little or no secondary information, the title will be assigned the
standard form of the most-cataloged title that is closest to the problem title.
- (b) If secondary data are available, the precedence order for comparisons will be (i) EANs, (ii) UPCs, (iii) actor names.
- (c) If secondary data are still insufficient to resolve the problem, we will throw a "disambiguation" page (horrid word)
whenever a user asks that title to be loaded into the Vault record page. The user will be asked whether he/she can provide the release year or director's name, which
the scripts can then use to resolve the issue. Note that this is not the same as asking a user to decide whether two movies are the same or different:
users would just be asked for concrete data so the software can apply the standard rules.
The following Table gives a few examples.
| Title in User Catalog | Release Year | Title in Vault Standard Form |
| | |
| Six days, seven nights | 1998 | Six Days, Seven Nights |
| 2001 – A Space Odyssey | 1968 | 2001: A Space Odyssey |
| Stargate, extended cut | 1993 | Stargate |
| Stargate SG–1: Season 2 | 1997 | Stargate SG–1 |
| Back to the Future complete trilogy | | Back To The Future |
| Sabrina | 1995 | Sabrina (1995) |
| Sabrina | 1953 | Sabrina (1953) |
| Blade Runner | 1982 | Blade Runner (1982) |
| Blade Runner – the Final Cut | 2007 | Blade Runner (2007) |
3. Versions in Standard Form
Standard versions will be all lower case and will be selected from a list of allowed values in their standard forms. Examples include
season 1, special edition, director's cut, and final version. The possible values will be obtained from those appearing in the user
database.
4. Tags in Standard Form
Standard tags will generally be all lowercase; exceptions may include names, such as awards (Acad Awd), world wars (WW II), etc.
A table of standard forms (like that for Versions) will also be built to resolve variants of the same tag, such as movie-from-book, based on book, etc.
Copyright © 2012 by J.M. Haile. All rights reserved.