Download and set up dataset
I'll be using Kaggle dataset for the same. You should have Kaggle's CLI installed and put your kaggle.json file in ~/.kaggle
directory (or /root/.kaggle
if on colab).
!kaggle datasets download -d jessicali9530/stanford-dogs-dataset -p data
!cd data && mkdir stanford-dogs &&\
unzip -qq stanford-dogs-dataset.zip -d stanford-dogs &&\
rm -rf stanford-dogs-dataset.zip
%cd data
!mv stanford-dogs/annotations/Annotation stanford-dogs/images/Images stanford-dogs
!rm -rf stanford-dogs/annotations stanford-dogs/images
%cd /content
!wget http://vision.stanford.edu/aditya86/ImageNetDogs/lists.tar
!tar -xf lists.tar test_list.mat && rm -rf lists.tar
from fastai2.vision.all import *
from fastai2.data.all import *
from fastprogress import progress_bar
from scipy.io import loadmat
test_list = loadmat('test_list.mat');
base = Path('data/stanford-dogs/Images')
test_fns =[base/l[0][0] for l in test_list['file_list']]
path = Path('data/stanford-dogs')
fnames = get_image_files(path/'Images'); fnames[:5]
Keeping test images aside for benchmark, will split train set into train/val
train_fns = list(set(fnames) - set(test_fns))
Following steps are performed:
- From all the directory names, we build a
vocab
of classes involved - Then we pass this
vocab
toCategoryMap
to geto2i
mapping - A Pipeline written to go from full path to its
label
(integer) - Procedding with StratifiedKFold where
X
is the file names andy
is repective integer label
re.split(r'-',"n02098105-soft-coated_wheaten_terrier",maxsplit=1)
impath = path/'Images';
dirs = L(impath.ls()).map(attrgetter('name'));
id2label = defaultdict()
for d in dirs:
k,v = re.split(r'-',d,maxsplit=1)
id2label[k]=v
id2label["n02098105"]
def get_lbl(o): return re.split(r'-',str(o),maxsplit=1)[0]
vocab = dirs.map(get_lbl); vocab
catm = CategoryMap(vocab,sort=False)
# to check the mapping
#print(catm.o2i)
X = array(train_fns)
ypipe = Pipeline([RegexLabeller(r'/(\w+)_\d+.jpg$'), catm.o2i.__getitem__])
y = array(L(train_fns).map(ypipe))
X[0], y[0], catm[y[0]]
ypipe
transforms filepath to index while catm
transforms indices back to labels (class_id)
We'll group a part between '/' and '_' (at the end of string). This will give us imagenet id for that class, We can use id2label
created earlier to map those ids to breed names
pat = re.compile(r"/(\w+)_\d+.jpg$")
res = pat.search(str(train_fns[105]))
print(f"Path: {train_fns[105]}\nLabel: {res.group(1)}\nBreed: {id2label[res.group(1)]}")
A csv would be great way to represent all the details associated with each example, including its fold_idx
labeller = RegexLabeller(r"/(\w+)_\d+.jpg$")
lbl_pipe = Pipeline([labeller, id2label.__getitem__])
lbl_pipe(fnames[10])
labels = L(train_fns).map(lbl_pipe)
class_ids = L(train_fns).map(labeller)
Now we have everything needed to create a csv file with Stratified k-folds.
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=4,shuffle=True,random_state=47)
df = pd.DataFrame({'image_path': train_fns,
'class_id': list(class_ids),
'label': list(labels)})
df['fold'] = -1
for i, (_, val_idx) in enumerate(skf.split(X,y)):
df.loc[val_idx,'fold'] = i
df.head()
df.to_csv('train.csv')
test_lbls = L(test_fns).map(lbl_pipe)
test_y = L(test_fns).map(ypipe)
test_cls_ids = L(test_fns).map(labeller)
test_df = pd.DataFrame({'image_path': test_fns,
'class_id': list(test_cls_ids),
'label': list(test_lbls),
'y': list(test_y)})
test_df.head()
test_df.to_csv('test.csv')
At this point, you have two csv files:
- train.csv: Labelled and Stratified K Fold on training examples (12000)
- test.csv: Test dataset from given test file names