K-means clustering is one of the most prominent clustering methods that is used in many applications. By considering a widespread application of k-means clustering, redesign of this method in the context of high-performance computing has a considerable impact. In this paper, we consider scalability and utilize the available resources at a different level of parallelism. As a result, novel techniques are proposed for different hardware platforms, which are evaluated separately on uniformly random generated datasets and with different sizes. We change classic two-stage Lloyd’s formulation to a three stage that utilizes different techniques for each stage separately. Besides, we use an algebraic technique to reduce the amount of calculation and lay the foundation for consequent ideas. In CPUs, we propose a parallel architecture based on OpenMP and AVX2 instruction set. In GPUs, we utilize atomic operation and shared memory without considering GPU memory, and shared memory capabilities. Proposed method extends to multi-GPU. We merge these techniques and utilize MPI to scale it for multiple-node platforms.