In this dissertation, we aim to investigate the problem of large-scale and fine-grained image recognition, which focuses on the differentiation of subtle differences among subordinate classes and a large number of images. Particularly, we tackle this problem by answering three inter-related questions: 1) how to learn robust and invariant feature representations that can differentiate subtle and fine-grained differences among subordinate classes, 2) how to index these features for efficient image analysis (e.g., classification, content-based retrieval) at a large scale, and 3) how to fuse different type of features to get better results. We propose a series of methods to solve these three problems. Regarding feature representation learning, we design an architecture of convolutional neural networks (CNNs), by unifying the classification constraint and the similarity constraint in a multi-learning framework. Also, structured labels are embedded in this framework, so the similarity of images can be defined at different levels of relevance, e.g., the number of shared attributes, through learned feature representations. Regarding feature indexing, we propose multiple methods based on hashing and binary coding, enabling real-time image retrieval and classification for high-dimensional features and/or a large number of features. Regarding feature fusion, we employ a graph-based query-specific fusion approach where multiple retrieval results (i.e., rank lists) are integrated and reordered based on a fused graph. We have evaluated these methods on both natural images and medical images, as we advocate that medical image recognition (e.g., cancer grading by histopathological images) needs ultra-fine-grained differentiation. The experimental results demonstrate the efficacy of our methods, in terms of both accuracy and efficiency.