Download and set up dataset

I'll be using Kaggle dataset for the same. You should have Kaggle's CLI installed and put your kaggle.json file in ~/.kaggle directory (or /root/.kaggle if on colab).

!kaggle datasets download -d jessicali9530/stanford-dogs-dataset -p data
!cd data && mkdir stanford-dogs &&\
 unzip -qq stanford-dogs-dataset.zip -d stanford-dogs &&\
 rm -rf stanford-dogs-dataset.zip

Downloading stanford-dogs-dataset.zip to data
 98% 737M/750M [00:09<00:00, 79.2MB/s]
100% 750M/750M [00:09<00:00, 80.3MB/s]

%cd data
!mv stanford-dogs/annotations/Annotation stanford-dogs/images/Images stanford-dogs
!rm -rf stanford-dogs/annotations stanford-dogs/images
%cd /content

!wget http://vision.stanford.edu/aditya86/ImageNetDogs/lists.tar 
!tar -xf lists.tar test_list.mat && rm -rf lists.tar

Explore Dataset

Goal of this section is to figure out way to label the images with breed names, splitting dataset into train/test and applying K-fold on train set.

Expected distribution at the end:

Train: 9000

Val: 3000

Test: 8000

from fastai2.vision.all import *
from fastai2.data.all import *
from fastprogress import progress_bar

from scipy.io import loadmat
test_list = loadmat('test_list.mat'); 

base = Path('data/stanford-dogs/Images')
test_fns =[base/l[0][0] for l in test_list['file_list']]
path = Path('data/stanford-dogs')
fnames = get_image_files(path/'Images'); fnames[:5]

Keeping test images aside for benchmark, will split train set into train/val

train_fns = list(set(fnames) - set(test_fns))

Stratified KFold

Following steps are performed:

From all the directory names, we build a vocab of classes involved
Then we pass this vocab to CategoryMap to get o2i mapping
A Pipeline written to go from full path to its label (integer)
Procedding with StratifiedKFold where X is the file names and y is repective integer label

re.split(r'-',"n02098105-soft-coated_wheaten_terrier",maxsplit=1)

['n02098105', 'soft-coated_wheaten_terrier']

impath = path/'Images'; 
dirs = L(impath.ls()).map(attrgetter('name')); 

id2label = defaultdict()
for d in dirs:
  k,v = re.split(r'-',d,maxsplit=1)
  id2label[k]=v

id2label["n02098105"]

'soft-coated_wheaten_terrier'

def get_lbl(o): return re.split(r'-',str(o),maxsplit=1)[0]
vocab = dirs.map(get_lbl); vocab

(#120) ['n02111277','n02097130','n02105251','n02110063','n02085936','n02115641','n02112018','n02099601','n02092002','n02098286'...]

catm = CategoryMap(vocab,sort=False) 
# to check the mapping
#print(catm.o2i)

X = array(train_fns)
ypipe = Pipeline([RegexLabeller(r'/(\w+)_\d+.jpg$'), catm.o2i.__getitem__])
y = array(L(train_fns).map(ypipe))
X[0], y[0], catm[y[0]]

(Path('data/stanford-dogs/Images/n02092002-Scottish_deerhound/n02092002_6114.jpg'),
 8,
 'n02092002')

ypipe transforms filepath to index while catm transforms indices back to labels (class_id)

Labelling

We'll group a part between '/' and '_' (at the end of string). This will give us imagenet id for that class, We can use id2label created earlier to map those ids to breed names

pat = re.compile(r"/(\w+)_\d+.jpg$")
res = pat.search(str(train_fns[105]))
print(f"Path: {train_fns[105]}\nLabel: {res.group(1)}\nBreed: {id2label[res.group(1)]}")

Path: data/stanford-dogs/Images/n02110063-malamute/n02110063_16539.jpg
Label: n02110063
Breed: malamute

A csv would be great way to represent all the details associated with each example, including its fold_idx

labeller = RegexLabeller(r"/(\w+)_\d+.jpg$")
lbl_pipe = Pipeline([labeller, id2label.__getitem__])
lbl_pipe(fnames[10])

'Newfoundland'

labels = L(train_fns).map(lbl_pipe)
class_ids = L(train_fns).map(labeller)

Now we have everything needed to create a csv file with Stratified k-folds.

from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=4,shuffle=True,random_state=47)
df = pd.DataFrame({'image_path': train_fns, 
                   'class_id': list(class_ids),
                   'label': list(labels)})
df['fold'] = -1
for i, (_, val_idx) in enumerate(skf.split(X,y)): 
  df.loc[val_idx,'fold'] = i
df.head()

df.to_csv('train.csv')

test_lbls = L(test_fns).map(lbl_pipe)
test_y = L(test_fns).map(ypipe)
test_cls_ids = L(test_fns).map(labeller)
test_df = pd.DataFrame({'image_path': test_fns,
                        'class_id': list(test_cls_ids),
                        'label': list(test_lbls),
                        'y': list(test_y)})
test_df.head()

test_df.to_csv('test.csv')

At this point, you have two csv files:

train.csv: Labelled and Stratified K Fold on training examples (12000)
test.csv: Test dataset from given test file names

	image_path	class_id	label	fold
0	data/stanford-dogs/Images/n02092002-Scottish_deerhound/n02092002_6114.jpg	n02092002	Scottish_deerhound	0
1	data/stanford-dogs/Images/n02092339-Weimaraner/n02092339_514.jpg	n02092339	Weimaraner	3
2	data/stanford-dogs/Images/n02096437-Dandie_Dinmont/n02096437_2267.jpg	n02096437	Dandie_Dinmont	3
3	data/stanford-dogs/Images/n02107683-Bernese_mountain_dog/n02107683_4016.jpg	n02107683	Bernese_mountain_dog	3
4	data/stanford-dogs/Images/n02111277-Newfoundland/n02111277_14330.jpg	n02111277	Newfoundland	2

	image_path	class_id	label	y
0	data/stanford-dogs/Images/n02085620-Chihuahua/n02085620_2650.jpg	n02085620	Chihuahua	24
1	data/stanford-dogs/Images/n02085620-Chihuahua/n02085620_4919.jpg	n02085620	Chihuahua	24
2	data/stanford-dogs/Images/n02085620-Chihuahua/n02085620_1765.jpg	n02085620	Chihuahua	24
3	data/stanford-dogs/Images/n02085620-Chihuahua/n02085620_3006.jpg	n02085620	Chihuahua	24
4	data/stanford-dogs/Images/n02085620-Chihuahua/n02085620_1492.jpg	n02085620	Chihuahua	24

Tutorial - Data Wrangling with fastai2

Download and set up dataset

Explore Dataset

Stratified KFold

Labelling