I wanted to get something out quickly, and mainly for the record.
I'd hoped to be able to list my curated Foundation Model Datasets (FMDs) on my own website, but that'll take longer than what I want to get this thing going. So instead of fiddling and fighting to get some kind of payment thing going on my own site, I've begun posting FMDs on my Etsy store. I can, in all honesty, call this artwork -- so it fits right in to a site like Esty:
My first set of offerings are going to be the transcripts of my ADL livestreams. I'm still the owner of this content (including video and transcripts) and can do with it as I please (ahem, so can YouTube). I'm still working on the price, but the initial price will be $7.77 for one dataset that usually contains about 10,000 total words and 10-20% unique words. LOL gotta find buyers who'll say 'No' so I can make the necessary price adjustments.
So in the case of the FMDs, I've copied the transcript from YT and edited it down to just the text without any other kind of markup. I then make a few measurements of the text and present that as a JSON formatted file that I use in the Description of the item for sale. The format of the JSON file is:
{
"dataset_metadata": {
"type": "text_transcript_analysis",
"version": "1.0",
"purpose": "Language Model Training Features"
},
"word_level_features": {
"total_word_count": "integer",
"unique_word_count": "integer",
"average_word_length": "float",
"vocabulary_diversity": {
"type_token_ratio": "float"
},
"word_frequency_distribution": {
"top_n_words": ["string"],
"word_frequency_curve": "array_of_floats"
}
},
"letter_level_features": {
"total_letter_count": "integer",
"case_distribution": {
"uppercase_percentage": "float",
"lowercase_percentage": "float"
},
"letter_frequency": {
"absolute_counts": {"a-z": "integer"},
"relative_frequencies": {"a-z": "float"}
}
},
"complexity_metrics": {
"word_length_distribution": "array_of_integers",
"entropy_measures": {
"word_entropy": "float",
"letter_entropy": "float"
}
}
}
At first, these datasets will be best used in the supervised (using any of the above metadata as labels) and non-supervised training of models like many Large Language Models.