In this tutorial, we'll work on 'Stanford-Dogs' Dataset which is a subset of Imagenet, considered for fine-grained classification. We'll work on extracting labels from path, transforming it to breed names for understandability and performing train/val/test splits

Download and set up dataset

I'll be using Kaggle dataset for the same. You should have Kaggle's CLI installed and put your kaggle.json file in ~/.kaggle directory (or /root/.kaggle if on colab).

!kaggle datasets download -d jessicali9530/stanford-dogs-dataset -p data
!cd data && mkdir stanford-dogs &&\
 unzip -qq stanford-dogs-dataset.zip -d stanford-dogs &&\
 rm -rf stanford-dogs-dataset.zip
Downloading stanford-dogs-dataset.zip to data
 98% 737M/750M [00:09<00:00, 79.2MB/s]
100% 750M/750M [00:09<00:00, 80.3MB/s]
%cd data
!mv stanford-dogs/annotations/Annotation stanford-dogs/images/Images stanford-dogs
!rm -rf stanford-dogs/annotations stanford-dogs/images
%cd /content
!wget http://vision.stanford.edu/aditya86/ImageNetDogs/lists.tar 
!tar -xf lists.tar test_list.mat && rm -rf lists.tar

Explore Dataset

Goal of this section is to figure out way to label the images with breed names, splitting dataset into train/test and applying K-fold on train set.

Expected distribution at the end:

Train: 9000

Val: 3000

Test: 8000

from fastai2.vision.all import *
from fastai2.data.all import *
from fastprogress import progress_bar
from scipy.io import loadmat
test_list = loadmat('test_list.mat'); 

base = Path('data/stanford-dogs/Images')
test_fns =[base/l[0][0] for l in test_list['file_list']]
path = Path('data/stanford-dogs')
fnames = get_image_files(path/'Images'); fnames[:5]

Keeping test images aside for benchmark, will split train set into train/val

train_fns = list(set(fnames) - set(test_fns))

Stratified KFold

Following steps are performed:

  1. From all the directory names, we build a vocab of classes involved
  2. Then we pass this vocab to CategoryMap to get o2i mapping
  3. A Pipeline written to go from full path to its label (integer)
  4. Procedding with StratifiedKFold where X is the file names and y is repective integer label
re.split(r'-',"n02098105-soft-coated_wheaten_terrier",maxsplit=1)
['n02098105', 'soft-coated_wheaten_terrier']
impath = path/'Images'; 
dirs = L(impath.ls()).map(attrgetter('name')); 

id2label = defaultdict()
for d in dirs:
  k,v = re.split(r'-',d,maxsplit=1)
  id2label[k]=v

id2label["n02098105"]
'soft-coated_wheaten_terrier'
def get_lbl(o): return re.split(r'-',str(o),maxsplit=1)[0]
vocab = dirs.map(get_lbl); vocab
(#120) ['n02111277','n02097130','n02105251','n02110063','n02085936','n02115641','n02112018','n02099601','n02092002','n02098286'...]
catm = CategoryMap(vocab,sort=False) 
# to check the mapping
#print(catm.o2i)
X = array(train_fns)
ypipe = Pipeline([RegexLabeller(r'/(\w+)_\d+.jpg$'), catm.o2i.__getitem__])
y = array(L(train_fns).map(ypipe))
X[0], y[0], catm[y[0]]
(Path('data/stanford-dogs/Images/n02092002-Scottish_deerhound/n02092002_6114.jpg'),
 8,
 'n02092002')

ypipe transforms filepath to index while catm transforms indices back to labels (class_id)

Labelling

We'll group a part between '/' and '_' (at the end of string). This will give us imagenet id for that class, We can use id2label created earlier to map those ids to breed names

pat = re.compile(r"/(\w+)_\d+.jpg$")
res = pat.search(str(train_fns[105]))
print(f"Path: {train_fns[105]}\nLabel: {res.group(1)}\nBreed: {id2label[res.group(1)]}")
Path: data/stanford-dogs/Images/n02110063-malamute/n02110063_16539.jpg
Label: n02110063
Breed: malamute

A csv would be great way to represent all the details associated with each example, including its fold_idx

labeller = RegexLabeller(r"/(\w+)_\d+.jpg$")
lbl_pipe = Pipeline([labeller, id2label.__getitem__])
lbl_pipe(fnames[10])
'Newfoundland'
labels = L(train_fns).map(lbl_pipe)
class_ids = L(train_fns).map(labeller)

Now we have everything needed to create a csv file with Stratified k-folds.

from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=4,shuffle=True,random_state=47)
df = pd.DataFrame({'image_path': train_fns, 
                   'class_id': list(class_ids),
                   'label': list(labels)})
df['fold'] = -1
for i, (_, val_idx) in enumerate(skf.split(X,y)): 
  df.loc[val_idx,'fold'] = i
df.head()
image_path class_id label fold
0 data/stanford-dogs/Images/n02092002-Scottish_deerhound/n02092002_6114.jpg n02092002 Scottish_deerhound 0
1 data/stanford-dogs/Images/n02092339-Weimaraner/n02092339_514.jpg n02092339 Weimaraner 3
2 data/stanford-dogs/Images/n02096437-Dandie_Dinmont/n02096437_2267.jpg n02096437 Dandie_Dinmont 3
3 data/stanford-dogs/Images/n02107683-Bernese_mountain_dog/n02107683_4016.jpg n02107683 Bernese_mountain_dog 3
4 data/stanford-dogs/Images/n02111277-Newfoundland/n02111277_14330.jpg n02111277 Newfoundland 2
df.to_csv('train.csv')
test_lbls = L(test_fns).map(lbl_pipe)
test_y = L(test_fns).map(ypipe)
test_cls_ids = L(test_fns).map(labeller)
test_df = pd.DataFrame({'image_path': test_fns,
                        'class_id': list(test_cls_ids),
                        'label': list(test_lbls),
                        'y': list(test_y)})
test_df.head()
image_path class_id label y
0 data/stanford-dogs/Images/n02085620-Chihuahua/n02085620_2650.jpg n02085620 Chihuahua 24
1 data/stanford-dogs/Images/n02085620-Chihuahua/n02085620_4919.jpg n02085620 Chihuahua 24
2 data/stanford-dogs/Images/n02085620-Chihuahua/n02085620_1765.jpg n02085620 Chihuahua 24
3 data/stanford-dogs/Images/n02085620-Chihuahua/n02085620_3006.jpg n02085620 Chihuahua 24
4 data/stanford-dogs/Images/n02085620-Chihuahua/n02085620_1492.jpg n02085620 Chihuahua 24
test_df.to_csv('test.csv')

At this point, you have two csv files:

  1. train.csv: Labelled and Stratified K Fold on training examples (12000)
  2. test.csv: Test dataset from given test file names