Thursday, January 30, 2025

Data For Foundation Models

It turns out that while I do retain ownership of the transcripts of my YouTube livestreams, there's nothing I can do with them without causing trouble.  This is from YT's TOS:


So -- onward.  I didn't expect that to go very far.  I'll keep the listings up on Etsy for a while, but I'll soon be able to curate labeled dataset.  Text is still an excellent data source, but harvesting and labeling is challenging.

The next kind of data that can be fairly easily harvested and curated are labeled images.  This is what I've turned my attention to.  The task at hand is to not only do the harvesting, but also the post-processing that will package the data up with excellent labeling.

The labeling will initially consist of the usual statistics on the data (min, max, mean, median, mode, standard deviation), and also an 'information content' parameter (entropy).  There will also be text labels that describe the image data in English.

Initially I'll package up labeled data into datasets which will be available for purchase on my website (yes, I'm off of Etsy).  I can see a time pretty soon when there will be 'made-to-order' datasets based on various parameters set by the client.  For example, a dataset that contains high-quality images of various red balls.

----------------

This plot is the 'information content' (the y-axis in units of bits) of the individual frames of a 66.31 second (x-axis) video of a partly cloudy sky.  The color of the lines indicates the most common color in the image.  That's a very nice sky blue, and contains more information when there's a cloud (like near the beginning).  An interesting rabbit hole, but this is one of the ways I'll 'label' this data -- by information content:


Here's a single image about mid-way through the video.

(gotta *love* that Arizona sky!!!)


The labels for this set of images look pretty good:

cumulus_clouds, azure_blue_sky, daytime_outdoor, white_cloud_formations, scattered_cloud_pattern, high_contrast_sky, natural_lighting, wispy_cloud_texture, atmospheric_scene, clear_weather_conditions, cloud_movement_visible, bright_sunlit_clouds, cirrus_cloud_wisps, deep_blue_atmosphere, horizontal_cloud_composition, mid_day_lighting, weather_photography, dynamic_cloud_shapes, stratified_cloud_layers, negative_space_composition

Information content in bits (y-axis) for H (top plot), S (middle), and V (bottom) of these color images.  Note the scale and values of the y-axes are different for each plot.  x-axis is still time in seconds.  Wow what a rabbit hole!




In this plot, there are the same number of images for each color (1/3 of the total).  Blue is the top third of information content, red is the middle third, and black is the bottom third.  This keeps the size of each section (in this case, the number of images) the same, which might be useful to someone carefully training or testing a foundation model.


There are many ways to organize and extract this data for training and testing foundation models.

No comments:

Post a Comment