Reply to comment
Refactoring Humanity
Submitted by bingomanatee on 5 February, 2009 - 10:02One common tactic in developing any sort of CRM or database reflecting users is to split peoples records into roles. I just got back from a SugarCRM conference (awesome -- thanks guys!) and even SugarCRM, one of the best written applications I've ever been able to poke around, they take this tactic. It has implications when it comes to address recording, duplicate identities and data efficiency that multiplies development where refactoring should be easy and intuitive.

The Sugar Model of Humanity
Sugar takes the usual tactic of dividing records by role: there are users, leads, targets/prospects, and contacts, each stored in a separate table and each requiring a separate object model. The Sugar system uses custom fields, but of course each of these tables has its own set of custom fields. Duplicate checking is pretty much "DIY": the system will give you a list of similar records by a set of key fields (such as e-mail) and from there you can manually "munge" the duplicates row by row.
A "flat table" approach to user data requires you to manage lead data in an unnatural and regimented fashion. Duplicate reduction is inherently driven by user judgment; for instance, given two matching addresses for the same person, there is no automatic way to determine which record is "right."
And in fact in many cases they both are. For instance,
- In one context a person might give you their home address, and in another, they might give you their work address.
- Some people also have multiple work addresses, such as traveling sales people.
- Even when you replace one address with another, you may still want to keep the old data around, such as when a person changes jobs; it may help qualify leads for sales to keep track of not only which company employs a person, but where they were working before, and before that.
- Many people have multiple e-mails and subscribe and unsubscribe to campaigns through different mail addresses in order to better manage their mail traffic
Any and all of these situations will tear hell out of a conventional CRM. Unless a careful, cogent model of identity is used, the multiple nature of identity in the modern world will stymie any effort to reconcile the many identities that people spawn.
Humanistic factoring
The principle of humanistic factoring is to have a single record for core and unique information and to attach subtables for role related data like addresses, job meta fields and other potentially duplicated or conflicting information.* The interesting thing to me is that personal-related records seem to be, bizaarely, the last things that schema designers normalize. I am not sure I know why schema designers seem to work this way, but even in otherwise well-normalized systems, the individual identity is rife with duplication and serial fields. Take phone numbers -- almost always, and even in situations where the phone numbers are sub-fielded into up to three parts (prefix/suffix/extension), they are run serially, even though individuals tend to have a random association of numbers (home/work/cell/fax/admin/900...) that tend to be a fairly empty matrix. Likewise, people tend to have multiple emails. And of course everyone who does not work at home has at least two addresses. Many of these fields, of course, can be associated with impersonal artifacts (companies, branches, etc.) for further optimization. And of course, though we all work in at least one business, many of us have links to several entities and in many cases, multiple jobs. In sum, every person has several one to many associations that compound their true identity; denormalizing these is a bad way to represent a person's real information cloud.
The Human Record
So given that, what does belong in the individuals core table? The human record is compact and has for the most part information that is definitively unique and unchanging.
- Name/title fields,
- Birthday
- Government ID (Social Security number in the US
- username and password
- Primary email**
This record is kept in a table called "Humans" or "People" whose identity is referred to externally as "human_id".***
Identity Metadata
Here is a short list of the tables that associate into the person(though they may associate with other records (companies, etc.) as well):
- Address (with qualifier - home/work/etc.)
- Website
- phone_fax
- Role (job, join to company)
- Access_grant
The fields of these tables should be fairly self-evident, and will probably be custom for each application. Whether they are foreign-keyed to an individual or HABTM joined to individuals, companies, etc. depends on the nature of the application and the level of concern for duplicate recordkeeping.
Denormalizing Humanity
As long as original data is maintained in normalized fashion, there is no reason not to flatten people out in optimized tables with serialized fields for home / work address, 3 or 4 typical role e-mails, etc. These flat records can be used for listing people in displays; however they are not the true system of record but merely a display convenience. When it comes to rendering tickets, search indexing, etc., the primary tables will still be the reference to use.
* I am omitting the ubiquitous "note" or "comment" fields in all the models below for clarity's sake, as well as fields for "created", "entrant_id" and "changed" which you'd want to include for auditing purposes in most data.
The Sugar convention of changing a "deleted" field to "1" instead of instantly deleting records is a good practice and any of these tables could follow that convention as well. This is not the same as the "status" field of the human_contact field: "bad" and "old" data would be both persistent and visible through the UI to help reconcile and qualify any new or conflicting data for that human.
** This doesn't follow the pattern of "Unique and singular" in that one person can have multiple e-mails but it is useful from a database efficiency point of view to have it in the primary human table.
*** (with a world population of 6,000,000, an unsigned integer should serve in most cases or you can follow the Sugar model of a time stamp-defendant ID for a replication safe record)
**** replaces_identity allows you to set up an "audit chain" by saving a new record with each update.
***** The status field is vital for tracking reconciliation of dumped data. A separate field for email_status might also be useful as that can be independently corroborated atomically, where address data is a bit more work.
