Monday, January 27, 2025

First Foundation Model Dataset Listed on Etsy

I wanted to get something out quickly, and mainly for the record.

I'd hoped to be able to list my curated Foundation Model Datasets (FMDs) on my own website, but that'll take longer than what I want to get this thing going.  So instead of fiddling and fighting to get some kind of payment thing going on my own site, I've begun posting FMDs on my Etsy store.  I can, in all honesty, call this artwork -- so it fits right in to a site like Esty:


My first set of offerings are going to be the transcripts of my ADL livestreams.  I'm still the owner of this content (including video and transcripts) and can do with it as I please (ahem, so can YouTube).  I'm still working on the price, but the initial price will be $7.77 for one dataset that usually contains about 10,000 total words and 10-20% unique words.  LOL gotta find buyers who'll say 'No' so I can make the necessary price adjustments.

So in the case of the FMDs, I've copied the transcript from YT and edited it down to just the text without any other kind of markup.  I then make a few measurements of the text and present that as a JSON formatted file that I use in the Description of the item for sale.  The format of the JSON file is:

{
    "dataset_metadata": {
        "type": "text_transcript_analysis",
        "version": "1.0",
        "purpose": "Language Model Training Features"
    },
    "word_level_features": {
        "total_word_count": "integer",
        "unique_word_count": "integer", 
        "average_word_length": "float",
        "vocabulary_diversity": {
            "type_token_ratio": "float"
        },
        "word_frequency_distribution": {
            "top_n_words": ["string"],
            "word_frequency_curve": "array_of_floats"
        }
    },
    "letter_level_features": {
        "total_letter_count": "integer",
        "case_distribution": {
            "uppercase_percentage": "float", 
            "lowercase_percentage": "float"
        },
        "letter_frequency": {
            "absolute_counts": {"a-z": "integer"},
            "relative_frequencies": {"a-z": "float"}
        }
    },
    "complexity_metrics": {
        "word_length_distribution": "array_of_integers",
        "entropy_measures": {
            "word_entropy": "float",
            "letter_entropy": "float"
        }
    }
}

At first, these datasets will be best used in the supervised (using any of the above metadata as labels) and non-supervised training of models like many Large Language Models.

Saturday, January 25, 2025

Labeled Data Is All We Need

(I realize that this has very little to directly do with astronomy or music, but I wanted to post this so there's a record. I should also mention that the text below was generated by claude.ai, with substantial prompting from myself.  It reflects my sentiments very well.)

Labeled Data Is All We Need

Preface: Looking Forward – Short-Term

In the coming days, I will establish a website where Foundation Models (FMs) can purchase and utilize data I am personally recording and curating. The labeled datasets will match the quality and format of those available from established sources like Hugging Face and AWS. Upon purchase, buyers will receive non-exclusive rights to use the data for training and testing purposes. New data will be continuously added to support ongoing Foundation Model development. Pricing details are currently to be determined.

While selling datasets to Foundation Models is not a novel concept, I observe a lack of individual contributors and a scarcity of unique, high-quality data. Though I often find myself ahead of technological trends, this particular initiative seems somewhat more aligned with current market movements.

I've long recognized my tendency to conceptualize ideas well before their widespread adoption. Typically, my innovative thinking outpaces my resources to fully develop these concepts. In this instance, however, I feel more synchronized with emerging industry trends. Comments from thought leaders like Ilya Sutskever and the recent Stargate Project announcement have validated my approach.

The Stargate Project's proposed $500 billion budget—approximately $342 million daily—intrigued me. I contemplated their potential expenditures: power infrastructure (potentially nuclear), computational resources for model training and prediction generation, and data storage. The critical question emerged: Why massive storage if, as Ilya suggests, no new data exists?

The answer became clear: In the same talk that Ilya said that we were out of data, he also said that "data is the fossil fuel of AI”.  I would put it another way, but that's the basic idea.  Foundation Models require substantial new labeled datasets from high-quality sources. My work directly addresses this emerging market need.

This whole idea has been compelling me for about a year. It crystallized after hearing the Stargate Project announcement, making me realize the potential market for carefully curated, high-quality data.

My professional focus will now be collecting and curating labeled data for Foundation Models.

Other major work that I'm doing -- astronomical research, and advanced robotics -- will continue.  No stopping either of these from happening.